2024

  • LLM Research Papers: The 2024 List
    I want to share my running bookmark list of many fascinating (mostly LLM-related) papers I stumbled upon in 2024. It's just a list, but maybe it will come in handy for those who are interested in finding some gems to read for the holidays.
  • Understanding Multimodal LLMs An Introduction to the Main Techniques and Latest Models
    There has been a lot of new research on the multimodal LLM front, including the latest Llama 3.2 vision models, which employ diverse architectural strategies to integrate various data types like text and images. For instance, The decoder-only method uses a single stack of decoder blocks to process all modalities sequentially. On the other hand, cross-attention methods (for example, used in Llama 3.2) involve separate encoders for different modalities with a cross-attention layer that allows these encoders to interact. This article explains how these different types of multimodal LLMs function. Additionally, I will review and summarize roughly a dozen other recent multimodal papers and models published in recent weeks to compare their approaches.
  • Building A GPT-Style LLM Classifier From Scratch Finetuning a GPT Model for Spam Classification
    This article shows you how to transform pretrained large language models (LLMs) into strong text classifiers. But why focus on classification? First, finetuning a pretrained model for classification offers a gentle yet effective introduction to model finetuning. Second, many real-world and business challenges revolve around text classification: spam detection, sentiment analysis, customer feedback categorization, topic labeling, and more.
  • Building LLMs from the Ground Up: A 3-hour Coding Workshop
    This tutorial is aimed at coders interested in understanding the building blocks of large language models (LLMs), how LLMs work, and how to code them from the ground up in PyTorch. We will kick off this tutorial with an introduction to LLMs, recent milestones, and their use cases. Then, we will code a small GPT-like LLM, including its data input pipeline, core architecture components, and pretraining code ourselves. After understanding how everything fits together and how to pretrain an LLM, we will learn how to load pretrained weights and finetune LLMs using open-source libraries.
  • New LLM Pre-training and Post-training Paradigms -- A Look at How Moderns LLMs Are Trained
    There are hundreds of LLM papers each month proposing new techniques and approaches. However, one of the best ways to see what actually works well in practice is to look at the pre-training and post-training pipelines of the most recent state-of-the-art models. Luckily, four major new LLMs have been released in the last months, accompanied by relatively detailed technical reports. In this article, I focus on the pre-training and post-training pipelines of the following models: Alibaba's Qwen 2, Apple Intelligence Foundation Language Models, Google's Gemma 2, Meta AI's Llama 3.1.
  • Instruction Pretraining LLMs -- The Latest Research in Instruction Finetuning
    This article covers a new, cost-effective method for generating data for instruction finetuning LLMs; instruction finetuning from scratch; pretraining LLMs with instruction data; and an overview of what's new in Gemma 2.
  • Developing an LLM: Building, Training, Finetuning A Deep Dive into the Lifecycle of LLM Development
    This is an overview of the LLM development process. This one-hour talk focuses on the essential three stages of developing an LLM: coding the architecture, implementing pretraining, and fine-tuning the LLM. Lastly, we also discuss the main ways LLMs are evaluated, along with the caveats of each method.
  • LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments? Discussing the Latest Model Releases and AI Research in May 2024
    This article covers three new papers related to instruction finetuning and parameter-efficient finetuning with LoRA in large language models (LLMs). I work with these methods on a daily basis, so it's always exciting to see new research that provides practical insights.
  • How Good Are the Latest Open LLMs? And Is DPO Better Than PPO? Discussing the Latest Model Releases and AI Research in April 2024
    What a month! We had four major open LLM releases: Mixtral, Meta AI's Llama 3, Microsoft's Phi-3, and Apple's OpenELM. In my new article, I review and discuss all four of these major transformer-based LLM model releases, followed by new research on reinforcement learning with human feedback methods for instruction finetuning using PPO and DPO algorithms.
  • Using and Finetuning Pretrained Transformers
    What are the different ways to use and finetune pretrained large language models (LLMs)? The three most common ways to use and finetune pretrained LLMs include a feature-based approach, in-context prompting, and updating a subset of the model parameters. First, most pretrained LLMs or language transformers can be utilized without the need for further finetuning. For instance, we can employ a feature-based method to train a new downstream model, such as a linear classifier, using embeddings generated by a pretrained transformer. Second, we can showcase examples of a new task within the input itself, which means we can directly exhibit the expected outcomes without requiring any updates or learning from the model. This concept is also known as prompting. Finally, it’s also possible to finetune all or just a small number of parameters to achieve the desired outcomes. This article discusses these types of approaches in greater depth
  • Tips for LLM Pretraining and Evaluating Reward Models Research Papers in March 2024
    It's another month in AI research, and it's hard to pick favorites. This month, I am going over a paper that discusses strategies for the continued pretraining of LLMs, followed by a discussion of reward modeling used in reinforcement learning with human feedback (a popular LLM alignment method), along with a new benchmark. Continued pretraining for LLMs is an important topic because it allows us to update existing LLMs, for instance, ensuring that these models remain up-to-date with the latest information and trends. Also, it allows us to adapt them to new target domains without having them to retrain from scratch. Reward modeling is important because it allows us to align LLMs more closely with human preferences and, to some extent, helps with safety. But beyond human preference optimization, it also provides a mechanism for learning and adapting LLMs to complex tasks by providing instruction-output examples where explicit programming of correct behavior is challenging or impractical.
  • Research Papers in February 2024 — A LoRA Successor, Small Finetuned LLMs Vs Generalist LLMs, and Transparent LLM Research
    Once again, this has been an exciting month in AI research. This month, I'm covering two new openly available LLMs, insights into small finetuned LLMs, and a new parameter-efficient LLM finetuning technique. The two LLMs mentioned above stand out for several reasons. One LLM (OLMo) is completely open source, meaning that everything from the training code to the dataset to the log files is openly shared. The other LLM (Gemma) also comes with openly available weights but achieves state-of-the-art performance on several benchmarks and outperforms popular LLMs of similar size, such as Llama 2 7B and Mistral 7B, by a large margin.
  • Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch
    Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model (for example, an LLM or vision transformer) to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters. In this article, we will take a look at both LoRA and DoRA, which is a new promising alternative to LoRA.

2023

  • Optimizing LLMs From a Dataset Perspective
    This article focuses on improving the modeling performance of LLMs by finetuning them using carefully curated datasets. Specifically, this article highlights strategies that involve modifying, utilizing, or manipulating the datasets for instruction-based finetuning rather than altering the model architecture or training algorithms (the latter will be topics of a future article). This article will also explain how you can prepare your own datasets to finetune open-source LLMs.
  • The NeurIPS 2023 LLM Efficiency Challenge Starter Guide
    Large language models (LLMs) offer one of the most interesting opportunities for developing more efficient training methods. A few weeks ago, the NeurIPS 2023 LLM Efficiency Challenge launched to focus on efficient LLM finetuning, and this guide is a short walkthrough explaining how to participate in this competition. This article covers everything you need to know, from setting up the coding environment to making the first submission.
  • Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch
    Peak memory consumption is a common bottleneck when training deep learning models such as vision transformers and LLMs. This article provides a series of techniques that can lower memory consumption by approximately 20x without sacrificing modeling performance and prediction accuracy.
  • Finetuning Falcon LLMs More Efficiently With LoRA and Adapters
    Finetuning allows us to adapt pretrained LLMs in a cost-efficient manner. But which method should we use? This article compares different parameter-efficient finetuning methods for the latest top-performing open-source LLM, Falcon. Using parameter-efficient finetuning methods outlined in this article, it's possible to finetune an LLM in 1 hour on a single GPU instead of a day on 6 GPUs.
  • Accelerating Large Language Models with Mixed-Precision Techniques
    Training and using large language models (LLMs) is expensive due to their large compute requirements and memory footprints. This article will explore how leveraging lower-precision formats can enhance training and inference speeds up to 3x without compromising model accuracy.
  • Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA)
    Pretrained large language models are often referred to as foundation models for a good reason: they perform well on various tasks, and we can use them as a foundation for finetuning on a target task. As an alternative to updating all layers, which is very expensive, parameter-efficient methods such as prefix tuning and adapters have been developed. Let's talk about one of the most popular parameter-efficient finetuning techniques: Low-rank adaptation (LoRA). What is LoRA? How does it work? And how does it compare to the other popular finetuning approaches? Let's answer all these questions in this article!
  • Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters
    In the rapidly evolving field of artificial intelligence, utilizing large language models in an efficient and effective manner has become increasingly important. Parameter-efficient finetuning stands at the forefront of this pursuit, allowing researchers and practitioners to reuse pretrained models while minimizing their computational and resource footprints. This article explains the broad concept of finetuning and discusses popular parameter-efficient alternatives like prefix tuning and adapters. Finally, we will look at the recent LLaMA-Adapter method and see how we can use it in practice.
  • Finetuning Large Language Models On A Single GPU Using Gradient Accumulation
    Previously, I shared an article using multi-GPU training strategies to speed up the finetuning of large language models. Several of these strategies include mechanisms such as model or tensor sharding that distributes the model weights and computations across different devices to work around GPU memory limitations. However, many of us don't have access to multi-GPU resources. So, this article illustrates a simple technique that works as a great workaround to train models with larger batch sizes when GPU memory is a concern: gradient accumulation.
  • Keeping Up With AI Research And News
    When it comes to productivity workflows, there are a lot of things I'd love to share. However, the one topic many people ask me about is how I keep up with machine learning and AI at large, and how I find interesting papers.
  • Some Techniques To Make Your PyTorch Models Train (Much) Faster
    This blog post outlines techniques for improving the training performance of your PyTorch model without compromising its accuracy. To do so, we will wrap a PyTorch model in a LightningModule and use the Trainer class to enable various training optimizations. By changing only a few lines of code, we can reduce the training time on a single GPU from 22.53 minutes to 2.75 minutes while maintaining the model's prediction accuracy. Yes, that's a 8x performance boost!
  • Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch
    In this article, we are going to understand how self-attention works from scratch. This means we will code it ourselves one step at a time. Since its introduction via the original transformer paper, self-attention has become a cornerstone of many state-of-the-art deep learning models, particularly in the field of Natural Language Processing. Since self-attention is now everywhere, it's important to understand how it works.
  • Understanding Large Language Models -- A Transformative Reading List
    Since transformers have such a big impact on everyone's research agenda, I wanted to flesh out a short reading list for machine learning researchers and practitioners getting started with large language models.
  • What Are the Different Approaches for Detecting Content Generated by LLMs Such As ChatGPT? And How Do They Work and Differ?
    Since the release of the AI Classifier by OpenAI made big waves yesterday, I wanted to share a few details about the different approaches for detecting AI-generated text. This article briefly outlines four approaches to identifying AI-generated contents.
  • Comparing Different Automatic Image Augmentation Methods in PyTorch
    Data augmentation is a key tool in reducing overfitting, whether it's for images or text. This article compares three Auto Image Data Augmentation techniques in PyTorch: AutoAugment, RandAugment, and TrivialAugment.
  • Curated Resources and Trustworthy Experts: The Key Ingredients for Finding Accurate Answers to Technical Questions in the Future
    Conversational chat bots such as ChatGPT probably will not be able replace traditional search engines and expert knowledge anytime soon. With the vast amount of misinformation available on the internet, the ability to distinguish between credible and unreliable sources remains challenging and crucial.
  • Training an XGBoost Classifier Using Cloud GPUs Without Worrying About Infrastructure
    Imagine you want to quickly train a few machine learning or deep learning models on the cloud but don't want to deal with cloud infrastructure. This short article explains how we can get our code up and running in seconds using the open source lightning library.
  • Open Source Highlights 2022 for Machine Learning & AI
    Recently, I shared the top 10 papers that I read in 2022. As a follow-up, I am compiling a list of my favorite 10 open-source releases that I discovered, used, or contributed to in 2022.
  • Influential Machine Learning Papers Of 2022
    Every day brings something new and exciting to the world of machine learning and AI, from the latest developments and breakthroughs in the field to emerging trends and challenges. To mark the start of the new year, below is a short review of the top ten papers I've read in 2022.

2022

  • Ahead Of AI, And What's Next?
    About monthly machine learning musings, and other things I am currently workin on ...
  • A Short Chronology Of Deep Learning For Tabular Data
    Occasionally, I share research papers proposing new deep learning approaches for tabular data on social media, which is typically an excellent discussion starter. Often, people ask for additional methods or counterexamples. So, with this short post, I aim to briefly summarize the major papers on deep tabular learning I am currently aware of. However, I want to emphasize that no matter how interesting or promising deep tabular methods look, I still recommend using a conventional machine learning method as a baseline. There is a reason why I cover conventional machine learning before deep learning in my books.
  • No, We Don't Have to Choose Batch Sizes As Powers Of 2
    Regarding neural network training, I think we are all guilty of doing this: we choose our batch sizes as powers of 2, that is, 64, 128, 256, 512, 1024, and so forth. There are some valid theoretical justifications for this, but how does it pan out in practice? We had some discussions about that in the last couple of days, and here I want to write down some of the take-aways so I can reference them in the future. I hope you'll find this helpful as well!
  • Sharing Deep Learning Research Models with Lightning Part 2: Leveraging the Cloud
    In this article, we will take deploy a Super Resolution App on the cloud using lightning.ai. The primary goal here is to see how easy it is to create and share a research demo. However, the cloud is for more than just model sharing: we will also learn how we can tap into additional GPU resources for model training.
  • Sharing Deep Learning Research Models with Lightning Part 1: Building A Super Resolution App
    In this post, we will build a Lightning App. Why? Because it is 2022, and it is time to explore a more modern take on interacting with, presenting, and sharing our deep learning models. We are going to tackle this in three parts. In this first part, we will learn what a Lightning App is and how we build a Super Resolution GAN demo.
  • Taking Datasets, DataLoaders, and PyTorch’s New DataPipes for a Spin
    The PyTorch team recently announced TorchData, a prototype library focused on implementing composable and reusable data loading utilities for PyTorch. In particular, the TorchData library is centered around DataPipes, which are meant to be a DataLoader-compatible replacement for the existing Dataset class.
  • Running PyTorch on the M1 GPU
    Today, PyTorch officially introduced GPU support for Apple's ARM M1 chips. This is an exciting day for Mac users out there, so I spent a few minutes trying it out in practice. In this short blog post, I will summarize my experience and thoughts with the M1 chip for deep learning tasks.
  • Creating Confidence Intervals for Machine Learning Classifiers
    Developing good predictive models hinges upon accurate performance evaluation and comparisons. However, when evaluating machine learning models, we typically have to work around many constraints, including limited data, independence violations, and sampling biases. Confidence intervals are no silver bullet, but at the very least, they can offer an additional glimpse into the uncertainty of the reported accuracy and performance of a model. This article outlines different methods for creating confidence intervals for machine learning models. Note that these methods also apply to deep learning.
  • Losses Learned -- Optimizing Negative Log-Likelihood and Cross-Entropy in PyTorch (Part 1)
    The cross-entropy loss is our go-to loss for training deep learning-based classifiers. In this article, I am giving you a quick tour of how we usually compute the cross-entropy loss and how we compute it in PyTorch. There are two parts to it, and here we will look at a binary classification context first. You may wonder why bother writing this article; computing the cross-entropy loss should be relatively straightforward!? Yes and no. We can compute the cross-entropy loss in one line of code, but there's a common gotcha due to numerical optimizations under the hood. (And yes, when I am not careful, I sometimes make this mistake, too.) So, in this article, let me tell you a bit about deep learning jargon, improving numerical performance, and what could go wrong.
  • TorchMetrics -- How do we use it, and what's the difference between .update() and .forward()?
    TorchMetrics is a really nice and convenient library that lets us compute the performance of models in an iterative fashion. It's designed with PyTorch (and PyTorch Lightning) in mind, but it is a general-purpose library compatible with other libraries and workflows. This iterative computation is useful if we want to track a model during iterative training or evaluation on minibatches (and optionally across on multiple GPUs). In deep learning, that's essentially *all the time*. However, when using TorchMetrics, one common question is whether we should use `.update()` or `.forward()`? (And that's also a question I certainly had when I started using it.). Here's a hands-on example and explanation.
  • Machine Learning with PyTorch and Scikit-Learn -- The *new* Python Machine Learning Book
    Machine Learning with PyTorch and Scikit-Learn has been a long time in the making, and I am excited to finally get to talk about the release of my new book. Initially, this project started as the 4th edition of Python Machine Learning. However, we made so many changes to the book that we thought it deserved a new title to reflect that. So, what's new, you may wonder? In this post, I am excited to tell you all about it.

2021

2020

2019

  • What's New in the 3rd Edition
    A brief summary of what's new in the 3rd edition of Python Machine Learning.
  • My First Year at UW-Madison and a Gallery of Awesome Student Projects
    Not too long ago, in the Summer of 2018, I was super excited to join the Department of Statistics at the University of Wisconsin-Madison after obtaining my Ph.D. after ~5 long and productive years. Now, two semesters later after finals' week, I finally found some quiet days to look back on what's happened since then. In this post, I am sharing a short reflection as well as a some of the exciting projects my students were working on.

2018

  • Model evaluation, model selection, and algorithm selection in machine learning Part IV - Comparing the performance of machine learning models and algorithms using statistical tests and nested cross-validation
    This final article in the series *Model evaluation, model selection, and algorithm selection in machine learning* presents overviews of several statistical hypothesis testing approaches, with applications to machine learning model and algorithm comparisons. This includes statistical tests based on target predictions for independent test sets (the downsides of using a single test set for model comparisons was discussed in previous articles) as well as methods for algorithm comparisons by fitting and evaluating models via cross-validation. Lastly, this article will introduce *nested cross-validation*, which has become a common and recommended a method of choice for algorithm comparisons for small to moderately-sized datasets.
  • Generating Gender-Neutral Face Images with Semi-Adversarial Neural Networks to Enhance Privacy
    I thought that it would be nice to have short and concise summaries of recent projects handy, to share them with a more general audience, including colleagues and students. So, I challenged myself to use fewer than 1000 words without getting distracted by the nitty-gritty details and technical jargon. In this post, I mainly cover some of my recent research in collaboration with the [iPRoBe Lab](http://iprobe.cse.msu.edu) that falls under the broad category of developing approaches to hide specific information in face images. The research discussed in this post is about "maximizing privacy while preserving utility."

2016

  • Model evaluation, model selection, and algorithm selection in machine learning Part III - Cross-validation and hyperparameter tuning
    Almost every machine learning algorithm comes with a large number of settings that we, the machine learning researchers and practitioners, need to specify. These tuning knobs, the so-called hyperparameters, help us control the behavior of machine learning algorithms when optimizing for performance, finding the right balance between bias and variance. Hyperparameter tuning for performance optimization is an art in itself, and there are no hard-and-fast rules that guarantee best performance on a given dataset. In Part I and Part II, we saw different holdout and bootstrap techniques for estimating the generalization performance of a model. We learned about the bias-variance trade-off, and we computed the uncertainty of our estimates. In this third part, we will focus on different methods of cross-validation for model evaluation and model selection. We will use these cross-validation techniques to rank models from several hyperparameter configurations and estimate how well they generalize to independent datasets.
  • Model evaluation, model selection, and algorithm selection in machine learning Part II - Bootstrapping and uncertainties
    In this second part of this series, we will look at some advanced techniques for model evaluation and techniques to estimate the uncertainty of our estimated model performance as well as its variance and stability. Then, in the next article, we will shift the focus onto another task that is one of the main pillar of successful, real-world machine learning applications -- Model Selection.
  • Model evaluation, model selection, and algorithm selection in machine learning Part I - The basics
    Machine learning has become a central part of our life -- as consumers, customers, and hopefully as researchers and practitioners! Whether we are applying predictive modeling techniques to our research or business problems, I believe we have one thing in common : We want to make good predictions! Fitting a model to our training data is one thing, but how do we know that it generalizes well to unseen data? How do we know that it doesn't simply memorize the data we fed it and fails to make good predictions on future samples, samples that it hasn't seen before? And how do we select a good model in the first place? Maybe a different learning algorithm could be better-suited for the problem at hand? Model evaluation is certainly not just the end point of our machine learning pipeline.

    Before we handle any data, we want to plan ahead and use techniques that are suited for our purposes. In this article, we will go over a selection of these techniques, and we will see how they fit into the bigger picture, a typical machine learning workflow.

2015

  • Writing 'Python Machine Learning' – A Reflection on a Journey
    It's been about time. I am happy to announce that "Python Machine Learning" was finally released today! Sure, I could just send an email around to all the people who were interested in this book. On the other hand, I could put down those 140 characters on Twitter (minus what it takes to insert a hyperlink) and be done with it. Even so, writing "Python Machine Learning" really was quite a journey for a few months, and I would like to sit down in my favorite coffeehouse once more to say a few words about this experience.
  • Python, Machine Learning, and Language Wars – A Highly Subjective Point of View
    This has really been quite a journey for me lately. And regarding the frequently asked question “Why did you choose Python for Machine Learning?” I guess it is about time to write my script. In this article, I really don’t mean to tell you why you or anyone else should use Python. But read on if you are interested in my opinion.
  • Single-Layer Neural Networks and Gradient Descent
    This article offers a brief glimpse of the history and basic concepts of machine learning. We will take a look at the first algorithmically described neural network and the gradient descent algorithm in context of adaptive linear neurons, which will not only introduce the principles of machine learning but also serve as the basis for modern multilayer neural networks in future articles.
  • Principal Component Analysis in 3 Simple Steps
    Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique that is used in numerous applications, such as stock market predictions, the analysis of gene expression data, and many more. In this tutorial, we will see that PCA is not just a “black box”, and we are going to unravel its internals in 3 basic steps.
  • Implementing a Weighted Majority Rule Ensemble Classifier in scikit-learn
    Here, I want to present a simple and conservative approach of implementing a weighted majority rule ensemble classifier in scikit-learn that yielded remarkably good results when I tried it in a kaggle competition. For me personally, kaggle competitions are just a nice way to try out and compare different approaches and ideas -- basically an opportunity to learn in a controlled environment with nice datasets.

2014

  • MusicMood – A Machine Learning Model for Classifying Music by Mood Based on Song Lyrics
    In this article, I want to share my experience with a recent data mining project which probably was one of my most favorite hobby projects so far. It's all about building a classification model that can automatically predict the mood of music based on song lyrics.
  • Turn Your Twitter Timeline into a Word Cloud – using Python
    Last week, I posted some visualizations in context of Happy Rock Song data mining project, and some people were curious about how I created the word clouds. Learn how to create YOUR personal Twitter Timeline!
  • Naive Bayes and Text Classification – Introduction and Theory
    Naive Bayes classifiers, a family of classifiers that are based on the popular Bayes’ probability theorem, are known for creating simple yet well performing models, especially in the fields of document classification and disease prediction. In this first part of a series, we will take a look at the theory of naive Bayes classifiers and introduce the basic concepts of text classification. In following articles, we will implement those concepts to train a naive Bayes spam filter and apply naive Bayes to song classification based on lyrics.
  • Kernel tricks and nonlinear dimensionality reduction via RBF kernel PCA
    The focus of this article is to briefly introduce the idea of kernel methods and to implement a Gaussian radius basis function (RBF) kernel that is used to perform nonlinear dimensionality reduction via KBF kernel principal component analysis (kPCA).
  • Predictive modeling, supervised machine learning, and pattern classification — the big picture
    When I was working on my next pattern classification application, I realized that it might be worthwhile to take a step back and look at the big picture of pattern classification in order to put my previous topics into context and to provide and introduction for the future topics that are going to follow.
  • Linear Discriminant Analysis – Bit by Bit
    I received a lot of positive feedback about the step-wise Principal Component Analysis (PCA) implementation. Thus, I decided to write a little follow-up about Linear Discriminant Analysis (LDA) — another useful linear transformation technique. Both LDA and PCA are commonly used dimensionality reduction techniques in statistics, pattern classification, and machine learning applications. By implementing the LDA step-by-step in Python, we will see and understand how it works, and we will compare it to a PCA to see how it differs.
  • Dixon's Q test for outlier identification – A questionable practice
    I recently faced the impossible task to identify outliers in a dataset with very, very small sample sizes and Dixon's Q test caught my attention. Honestly, I am not a big fan of this statistical test, but since Dixon's Q-test is still quite popular in certain scientific fields (e.g., chemistry) that it is important to understand its principles in order to draw your own conclusion of the presented research data that you might stumble upon in research articles or scientific talks.
  • About Feature Scaling and Normalization – and the effect of standardization for machine learning algorithms
    I received a couple of questions in response to my previous article (Entry point: Data) where people asked me why I used Z-score standardization as feature scaling method prior to the PCA. I added additional information to the original article, however, I thought that it might be worthwhile to write a few more lines about this important topic in a separate article.
  • Entry Point Data – Using Python's sci-packages to prepare data for Machine Learning tasks and other data analyses
    In this short tutorial I want to provide a short overview of some of my favorite Python tools for common procedures as entry points for general pattern classification and machine learning tasks, and various other data analyses.
  • Molecular docking, estimating free energies of binding, and AutoDock's semi-empirical force field
    Discussions and questions about methods, approaches, and tools for estimating (relative) binding free energies of protein-ligand complexes are quite popular, and even the simplest tools can be quite tricky to use. Here, I want to briefly summarize the idea of molecular docking, and give a short overview about how we can use AutoDock 4.2's hybrid approach for evaluating binding affinities.
  • An introduction to parallel programming using Python's multiprocessing module – using Python's multiprocessing module
    The default Python interpreter was designed with simplicity in mind and has a thread-safe mechanism, the so-called "GIL" (Global Interpreter Lock). In order to prevent conflicts between threads, it executes only one statement at a time (so-called serial processing, or single-threading). In this introduction to Python's multiprocessing module, we will see how we can spawn multiple subprocesses to avoid some of the GIL's disadvantages and make best use of the multiple cores in our CPU.
  • Kernel density estimation via the Parzen-Rosenblatt window method – explained using Python
    The Parzen-window method (also known as Parzen-Rosenblatt window method) is a widely used non-parametric approach to estimate a probability density function *p(**x**)* for a specific point *p(**x**)* from a sample *p(**x**n)* that doesn't require any knowledge or assumption about the underlying distribution.
  • Numeric matrix manipulation – The cheat sheet for MATLAB, Python NumPy, R, and Julia
    At its core, this article is about a simple cheat sheet for basic operations on numeric matrices, which can be very useful if you working and experimenting with some of the most popular languages that are used for scientific computing, statistics, and data analysis.
  • The key differences between Python 2.7.x and Python 3.x with examples
    Many beginning Python users are wondering with which version of Python they should start. My answer to this question is usually something along the lines 'just go with the version your favorite tutorial was written in, and check out the differences later on.'\ But what if you are starting a new project and have the choice to pick? I would say there is currently no 'right' or 'wrong' as long as both Python 2.7.x and Python 3.x support the libraries that you are planning to use. However, it is worthwhile to have a look at the major differences between those two most popular versions of Python to avoid common pitfalls when writing the code for either one of them, or if you are planning to port your project.
  • 5 simple steps for converting Markdown documents into HTML and adding Python syntax highlighting
    In this little tutorial, I want to show you in 5 simple steps how easy it is to add code syntax highlighting to your blog articles.
  • Creating a table of contents with internal links in IPython Notebooks and Markdown documents
    Many people have asked me how I create the table of contents with internal links for my IPython Notebooks and Markdown documents on GitHub. Well, no (IPython) magic is involved, it is just a little bit of HTML, but I thought it might be worthwhile to write this little how-to tutorial.
  • A Beginner's Guide to Python's Namespaces, Scope Resolution, and the LEGB Rule
    A short tutorial about Python's namespaces and the scope resolution for variable names using the LEGB-rule with little quiz-like exercises.
  • Diving deep into Python – the not-so-obvious language parts
    Some while ago, I started to collect some of the not-so-obvious things I encountered when I was coding in Python. I thought that it was worthwhile sharing them and encourage you to take a brief look at the section-overview and maybe you'll find something that you do not already know - I can guarantee you that it'll likely save you some time at one or the other tricky debugging challenge.
  • Implementing a Principal Component Analysis (PCA) – in Python, step by step
    In this article I want to explain how a Principal Component Analysis (PCA) works by implementing it in Python step by step. At the end we will compare the results to the more convenient Python PCA() classes that are available through the popular matplotlib and scipy libraries and discuss how they differ.
  • Installing Scientific Packages for Python3 on MacOS 10.9 Mavericks
    I just went through some pain (again) when I wanted to install some of Python's scientific libraries on my second Mac. I summarized the setup and installation process for future reference.\ If you encounter any different or additional obstacles let me know, and please feel free to make any suggestions to improve this short walkthrough.
  • A thorough guide to SQLite database operations in Python
    After I wrote the initial teaser article "SQLite - Working with large data sets in Python effectively" about how awesome SQLite databases are via sqlite3 in Python, I wanted to delve a little bit more into the SQLite syntax and provide you with some more hands-on examples.
  • Using OpenEye software for substructure alignments and best-matching low-energy conformer overlays
    This is a quickguide showing how to use OpenEye software command line tools to align target molecules to a query based on substructure matches and how to retrieve the best molecule overlay from two sets of low-energy conformers.

2013

  • Unit testing in Python – Why we want to make it a habit
    Let’s be honest, code testing is everything but a joyful task. However, a good unit testing framework makes this process as smooth as possible. Eventually, testing becomes a regular and continuous process, accompanied by the assurance that our code will operate just as exact and seamlessly as a Swiss clockwork.
  • A short tutorial for decent heat maps in R
    I received many questions from people who want to quickly visualize their data via heat maps - ideally as quickly as possible. This is the major issue of exploratory data analysis, since we often don’t have the time to digest whole books about the particular techniques in different software packages to just get the job done. But once we are happy with our initial results, it might be worthwhile to dig deeper into the topic in order to further customize our plots and maybe even polish them for publication. In this post, my aim is to briefly introduce one of R’s several heat map libraries for a simple data analysis. I chose R, because it is one of the most popular free statistical software packages around. Of course there are many more tools out there to produce similar results (and even in R there are many different packages for heat maps), but I will leave this as an open topic for another time.
  • SQLite – Working with large data sets in Python effectively
    My new project confronted me with the task of screening a massive set of large data files in text format with billions of entries each. I will have to retrieve data repeatedly and frequently in the future. Thus, I was tempted to find a better solution than brute-force scanning through ~20 separate 1-column text files with ~6 billion entries every time line by line.