Analyze projects programming languages using github-linguist
github-linguist is a Ruby library and command line tool for detecting the programming languages used in a project. It is used by GitHub to detect the language of a project and to generate language statistics.
We can use it through the command line, in order to analyze the programming languages used in a project.
During my application to bioinformatics master degree, I had to say which programming languages I commend. So here is some quick tips to use github-linguist
as I learned to do for this purpose.
Installation
Requirements
- Ruby,
gem
andruby-devel
packages
Install
$ gem install github-linguist
Usage
$ github-linguist
For instance on my blog source code I get:
65.80% 945252 Jupyter Notebook
17.12% 245876 CSS
14.37% 206405 HTML
1.29% 18478 Less
0.77% 11019 Python
0.43% 6212 Shell
0.17% 2472 Makefile
0.06% 879 JavaScript
Let’s use a script to get the result for all my projects:
Let’s assume you have a directory with all your projects, say in ~/Documents/Projects/
:
#!/bin/bash
# linguist.sh
# Usage: ./linguist.sh ~/Documents/Projects/
#
# Recursively search for git repositories in the given directory
# and print the programming languages used in each of them.
# Get the directory to search for git repositories
if [ -z "$1" ]; then
echo "Usage: $0 <directory>"
exit 1
fi
DIR=$1
for REPO in $(find $DIR -name .git -type d); do
echo -e "repo: $REPO"
github-linguist $REPO/
echo -e "\n"
done
$ ./linguist.sh ~/Documents/Projects/
This has the disadventage to print the result of each repository, including included dependencies.
Let’s assume that a project is a git repository root, and that the other git repositories in subdirectories are dependencies:
#!/bin/bash
# linguist.sh
# Usage: ./linguist.sh ~/Documents/Projects/
#
# Recursively search for git repositories in the given directory
# and print the programming languages used in each of them.
recurse_directory() {
local directory
directory="$1"
if [[ -d "$directory" ]]; then
if [[ -d "$directory/.git" ]]; then
echo -e "repo: $directory"
github-linguist $directory/
echo -e "\n"
else
for subdirectory in "$directory"/*; do
recurse_directory "$subdirectory"
done
fi
fi
}
local directory
directory="$1"
if [[ -z "$directory" ]]; then
echo "Usage: $0 <directory>"
exit 1
fi
recurse_directory "$directory"
Let’s have fun with some statistics
One we have our report for all our projects, we can use some tools to get some statistics.
First, let’s transform the output of linguist.sh
into a CSV file using awk
:
- Counting the number of projects using a programming language:
#!/bin/awk -f
# linguist-count.awk
BEGIN {
OFS = ";"
}
/^[0-9]/ {
languages[$3]++
}
END {
for (language in languages) {
print language, languages[language]
}
}
$ ./linguist.sh ~/Documents/Projects/ | ./linguist-count.awk
Now that we have our first data let’s use R to plot it:
#!/usr/bin/Rscript
# linguist-count.R
# Usage: ./linguist-count.R <csv>
#
# Plot the number of projects using a programming language.
library(ggplot2)
# Parse the command line arguments
args <- commandArgs(trailingOnly = TRUE)
if (length(args) != 1) {
stop("Usage: ./linguist-count.R <csv>")
}
csv <- args[1]
languages_count_df <- read.csv(file = csv, header = FALSE, sep = ";")
colnames(languages_count_df) <- c("language", "count")
ggplot(data = languages_count_df, aes(x = reorder(language, count), y = count)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(x = "Programming language", y = "Number of projects")
ggsave("linguist-project-count.png", width = 10, height = 5)