The blog of a juvenile Geekus biologicus

Analyze projects programming languages using github-linguist

github-linguist is a Ruby library and command line tool for detecting the programming languages used in a project. It is used by GitHub to detect the language of a project and to generate language statistics.

We can use it through the command line, in order to analyze the programming languages used in a project.

During my application to bioinformatics master degree, I had to say which programming languages I commend. So here is some quick tips to use github-linguist as I learned to do for this purpose.




$ gem install github-linguist


$ github-linguist

For instance on my blog source code I get:

65.80%  945252     Jupyter Notebook
17.12%  245876     CSS
14.37%  206405     HTML
1.29%   18478      Less
0.77%   11019      Python
0.43%   6212       Shell
0.17%   2472       Makefile
0.06%   879        JavaScript

Let’s use a script to get the result for all my projects:

Let’s assume you have a directory with all your projects, say in ~/Documents/Projects/:

# Usage: ./ ~/Documents/Projects/
# Recursively search for git repositories in the given directory
# and print the programming languages used in each of them.

# Get the directory to search for git repositories
if [ -z "$1" ]; then
    echo "Usage: $0 <directory>"
    exit 1
for REPO in $(find $DIR -name .git -type d); do
    echo -e "repo: $REPO"
    github-linguist $REPO/
    echo -e "\n"
$ ./ ~/Documents/Projects/

This has the disadventage to print the result of each repository, including included dependencies.

Let’s assume that a project is a git repository root, and that the other git repositories in subdirectories are dependencies:

# Usage: ./ ~/Documents/Projects/
# Recursively search for git repositories in the given directory
# and print the programming languages used in each of them.

recurse_directory() {
    local directory
    if [[ -d "$directory" ]]; then
        if [[ -d "$directory/.git" ]]; then
            echo -e "repo: $directory"
            github-linguist $directory/
            echo -e "\n"
            for subdirectory in "$directory"/*; do
                recurse_directory "$subdirectory"

local directory
if [[ -z "$directory" ]]; then
    echo "Usage: $0 <directory>"
    exit 1
recurse_directory "$directory"

Let’s have fun with some statistics

One we have our report for all our projects, we can use some tools to get some statistics.

First, let’s transform the output of into a CSV file using awk:

  1. Counting the number of projects using a programming language:
#!/bin/awk -f
# linguist-count.awk
    OFS = ";"

/^[0-9]/ {

    for (language in languages) {
        print language, languages[language]
$ ./ ~/Documents/Projects/ | ./linguist-count.awk

Now that we have our first data let’s use R to plot it:

# linguist-count.R
# Usage: ./linguist-count.R <csv>
# Plot the number of projects using a programming language.


# Parse the command line arguments

args <- commandArgs(trailingOnly = TRUE)
if (length(args) != 1) {
    stop("Usage: ./linguist-count.R <csv>")
csv <- args[1]

languages_count_df <- read.csv(file = csv, header = FALSE, sep = ";")
colnames(languages_count_df) <- c("language", "count")

ggplot(data = languages_count_df, aes(x = reorder(language, count), y = count)) +
    geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    labs(x = "Programming language", y = "Number of projects")

ggsave("linguist-project-count.png", width = 10, height = 5)
bar plots of languages I used, according to github-linguist