The blog of a juvenile Geekus biologicus

Analyze projects programming languages using github-linguist

github-linguist is a Ruby library and command line tool for detecting the programming languages used in a project. It is used by GitHub to detect the language of a project and to generate language statistics.

We can use it through the command line, in order to analyze the programming languages used in a project.

During my application to bioinformatics master degree, I had to say which programming languages I commend. So here is some quick tips to use github-linguist as I learned to do for this purpose.

Installation

Requirements

Install

$ gem install github-linguist

Usage

$ github-linguist

For instance on my blog source code I get:

65.80%  945252     Jupyter Notebook
17.12%  245876     CSS
14.37%  206405     HTML
1.29%   18478      Less
0.77%   11019      Python
0.43%   6212       Shell
0.17%   2472       Makefile
0.06%   879        JavaScript

Let’s use a script to get the result for all my projects:

Let’s assume you have a directory with all your projects, say in ~/Documents/Projects/:

#!/bin/bash
# linguist.sh
# Usage: ./linguist.sh ~/Documents/Projects/
#
# Recursively search for git repositories in the given directory
# and print the programming languages used in each of them.

# Get the directory to search for git repositories
if [ -z "$1" ]; then
    echo "Usage: $0 <directory>"
    exit 1
fi
DIR=$1
for REPO in $(find $DIR -name .git -type d); do
    echo -e "repo: $REPO"
    github-linguist $REPO/
    echo -e "\n"
done
$ ./linguist.sh ~/Documents/Projects/

This has the disadventage to print the result of each repository, including included dependencies.

Let’s assume that a project is a git repository root, and that the other git repositories in subdirectories are dependencies:

#!/bin/bash
# linguist.sh
# Usage: ./linguist.sh ~/Documents/Projects/
#
# Recursively search for git repositories in the given directory
# and print the programming languages used in each of them.

recurse_directory() {
    local directory
    directory="$1"
    if [[ -d "$directory" ]]; then
        if [[ -d "$directory/.git" ]]; then
            echo -e "repo: $directory"
            github-linguist $directory/
            echo -e "\n"
        else
            for subdirectory in "$directory"/*; do
                recurse_directory "$subdirectory"
            done
        fi
    fi
}

local directory
directory="$1"
if [[ -z "$directory" ]]; then
    echo "Usage: $0 <directory>"
    exit 1
fi
recurse_directory "$directory"

Let’s have fun with some statistics

One we have our report for all our projects, we can use some tools to get some statistics.

First, let’s transform the output of linguist.sh into a CSV file using awk:

  1. Counting the number of projects using a programming language:
#!/bin/awk -f
# linguist-count.awk
BEGIN {
    OFS = ";"
}

/^[0-9]/ {
    languages[$3]++
}

END {
    for (language in languages) {
        print language, languages[language]
    }
}
$ ./linguist.sh ~/Documents/Projects/ | ./linguist-count.awk

Now that we have our first data let’s use R to plot it:

#!/usr/bin/Rscript
# linguist-count.R
# Usage: ./linguist-count.R <csv>
#
# Plot the number of projects using a programming language.

library(ggplot2)

# Parse the command line arguments

args <- commandArgs(trailingOnly = TRUE)
if (length(args) != 1) {
    stop("Usage: ./linguist-count.R <csv>")
}
csv <- args[1]

languages_count_df <- read.csv(file = csv, header = FALSE, sep = ";")
colnames(languages_count_df) <- c("language", "count")

ggplot(data = languages_count_df, aes(x = reorder(language, count), y = count)) +
    geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    labs(x = "Programming language", y = "Number of projects")

ggsave("linguist-project-count.png", width = 10, height = 5)
bar plots of languages I used, according to github-linguist