# Reproducing "A Large-Scale Study of Programming Languages and Code Quality in GitHub: A Reproducibility Study"

This is an attempt to reproduce the data processing and analysis from this paper, from its associated Github repo, itself a reproducibility study on an earlier paper. The initial method used to create this notebook is to process all of the `.Rmd`

(R markdown) files in the repo's root directory into one file (done here), and then import that file into Nextjournal. The result will have data and library codes filled in, formatting issues corrected, and any required code changes made to get processing running and results displayed will be annotated.

## 1. Building the Artifact of the original paper (original_artifact.Rmd)

The following command extract the contents of the original artifact into the `original`

directory alongside the archive. The original artifact contains two folders, the `R_code`

folder in which the original R scripts can be found as they were given to us and the `sqlDump`

folder, which contains the data.

cd original tar -zxf artifact.tgz cd ..

After the original artifact is extracted, the sql data dump must be combined together to produce the two files the rest of the paper is concerned with, namely the `_everything.csv`

file containing th entire dataset and th `_newSha.csv`

containing filtered data. The original artifact contained the `rebuild.sh`

script to produce these files, which can be found in the `original/sqlDump/DATA`

directory:

cd original/sqlDump ./rebuild.sh cd ../..

## 2. A Large Scale Study of Programming Languages and Code Quality in Github --- Repetition (repetition.Rmd)

Initially we load the implementation and initialize the environment making sure that all outputs of this script will be stored in `artifact/repetition`

subfolder:

# load the file containing the actual implementation details knitr::opts_chunk$set(echo = FALSE) source("implementation.R") initializeEnvironment("./artifact/repetition")

### 2.1. Data collection (chapter 2.2 of the FSE manuscript)

We staert with data acquisition. The artifact obtained from the original authors contained two CSV files. The file `everything`

contains one row per unique commit and filename pair. The file `newSha`

contains one row per unique commit and language pair. Thus, one row of `newSha`

summarizes one or more rows of `everything`

. By looking at the code of the artifact we have determined that the contents of the `newSha`

file was used for the analysis.

everything <- loadEverything() newSha <- loadNewSha()

#### 2.1.1. Number of projects analyzed

The study claims that `729`

projects were used and that each one of them had more than 27 commits in any language used:

newSha_project_names = sort(unique(newSha$project)) number_of_projects <- length(newSha_project_names) check(number_of_projects==729) projects_with_too_few_commits = newSha %>% group_by(project, language) %>% dplyr::summarize(n = n()) %>% filter(n <= 27) check(nrow(projects_with_too_few_commits) == 0)

Let's see how many projects are there in `everything`

:

# There are projects in the larger file that are missing from the smaller file, with no explanation in the paper everything_project_names <- sort(unique(everything$project)) assert_that(length(everything_project_names)==877) # some projects are not used in the paper projects_not_in_paper <- setdiff(everything_project_names, newSha_project_names) number_of_projects_not_included <- length(projects_not_in_paper) check(number_of_projects_not_included == 148) # some commits are not used everything %>% filter(project %in% projects_not_in_paper) -> everything_unused # and the rest is used everything %>% filter( project %in% newSha_project_names) -> everything_used assert_that(nrow(everything_used)==4956282)

Ok, that's more, check if the discrepancy is explained by too few commits for the missing projects:

number_of_large_projects_not_included <- nrow(everything %>% dplyr::filter(project %in% projects_not_in_paper) %>% group_by(project,tag) %>% dplyr::summarize (n=n()) %>% filter(n>27)) check(number_of_large_projects_not_included == 101) out("numberOfProjectsIncluded", number_of_projects ) out("numberOfProjectsNotIncluded", number_of_projects_not_included) out("numberOfLargeProjectsNotIncluded", number_of_large_projects_not_included)

We found data for `148`

projects that have been ignored in the study. Although the study filtered out projects with fewer than 28 commits, 101 of the ignored projects have 28 commits or more.

#### 2.1.2. Sizes of the code analyzed

The study claims that `63M`

SLOC were analyzed and uses the difference betweeen insertions and deletions as reported by git to calculate this. Although this metric is not precise, we follow with it:

sloc <- sum(newSha$insertion) - sum(newSha$deletion) sloc_mio <- round(sloc / 1000000, 1) check(sloc_mio == 63) # what the paper claims out("slocMio", sloc_mio) check(sloc_mio == 80.7)

In fact, there are There are 80.7 mio lines of code

#### 2.1.3. Claim: `29K authors`

`29K authors`

number_authors <- length(unique(everything_used$author)) check(number_authors / 1000 == 29) # what the paper says check(number_authors == 47297) number_committers <- length(unique(everything_used$committer)) check(number_committers == 29082) out("numberCommitters", round(number_committers/1000, 0)) out("numberAuthors", round(number_authors/1000, 0)) everything %>% filter(committer == "Linus Torvalds") -> lt everything %>% filter(author == "Linus Torvalds") -> lt2 out("linusCommitter", nrow(lt)) out("linusAuthor", nrow(lt2))

There are `29K`

committers, but the study is interested in active developers and not just people with commit rights. There are `47K`

developers.

#### 2.1.4. Claim: `1.5M commits`

`1.5M commits`

number_of_commits <- length(unique(newSha$sha)) check(number_of_commits == 1478609) out("numberCommitsMio", round(number_of_commits/1000/1000,1))

There are 1.5m commits.

#### 2.1.5. Claim: `17 languages`

`17 languages`

number_of_languages = length(unique(newSha$language)) check(number_of_languages == 17)

The authors have indeed operated with `17`

languages.

#### 2.1.6. Project-Language pairs with fewer than 20 commits are ignored

The authors claim that they have excluded project-language pairs for which there was fewer than 20 commits in project. Let's see if there are any such pairs in `newSha`

:

newSha %>% group_by(project, language) %>% dplyr::summarize(n=n()) %>% filter(n < 20) -> too_small_to_keep check(nrow(too_small_to_keep) == 0)

There are none, so the data has already been cleaned.

#### 2.1.7. Claim: 564,625 buggy commits.

newSha %>% filter(isbug == 1) -> bugFixes numberOfBugFixes <- length(unique(bugFixes$sha)) check(numberOfBugFixes == 530777) out("numberOfBugFixes", numberOfBugFixes)

The number of bug fixes is close but not quite matching. We have about `30K`

fewer.

#### 2.1.8. `everything`

to `newSha`

summary

`everything`

to `newSha`

summaryLet's now summarize `everything`

which contains only the projects used in `newSha`

to check for any further discrepancies wrt `newSha`

. First we only select the columns we will need and convert discrete data into factors to limit memory footprint and speedup execution:

# The data frames have redundancy. Delete columns we can derive and remove columns we don't use. Turn character columns into factors. everything %>% dplyr::select(tag, # language assigned by authors author, # name of author committer, # name of committer (not same as author) commit_date, # when was the commit made commit_age, # ??? insertion, # number of lines inserted deletion, # number of lines deleted sha, # commit id project, # project name file_name, # commit filename domain, isbug ) -> everything_clean # turn discrete data into factors everything_clean %>% mutate(language = as.factor(tag), committer = as.factor(committer), project = as.factor(project) ) -> everything_clean

Next, there are some redundancies in `everything`

, such as rows that differ in `github_languag`

column, but describe the same file (we found `23210`

such rows). Since we are not interested in the `github_language`

column any more, we can simply ignore such redundancies:

# We remove keep only one row per unique columns we are interested in everything_clean %>% distinct(sha,file_name, project,.keep_all=TRUE) -> everything_clean

To ease our life down the road, we also rename the languages factor to use same names as does the `newSha`

version, and order the factor in alphabetical order for better visualization:

# sanity: use the same labels as in newSha everything_clean$language <- revalue(everything_clean$language, c( "c"="C", "clojure"="Clojure", "coffeescript"="Coffeescript", "cpp"="C++","csharp"="C#", "erlang"="Erlang", "go"="Go", "haskell"="Haskell", "java"="Java", "javascript"="Javascript", "objc"="Objective-C", "perl"="Perl", "php"="Php", "python"="Python", "ruby"="Ruby", "scala"="Scala", "typescript"="Typescript")) # sanity: levels in alphabetical order everything_clean$language <- factor(everything_clean$language, levels=sort(levels(everything_clean$language)))

Finally, let's summarize `everything_clean`

per language:

checkSha = everything_clean %>% dplyr::select(-(file_name)) %>% group_by(sha) %>% mutate(files = n()) %>% distinct(sha, .keep_all = T) %>% dplyr::select(language, project, sha, files, committer, author, commit_age, insertion, deletion, isbug, domain)

Let's see the number of unique commits found in `newSha`

and the version constructed from `everything`

:

newSha_distinct = unique(newSha$sha) checkSha_distinct = unique(checkSha$sha) in_both = intersect(newSha_distinct, checkSha_distinct) out("notInNewShaRaw", length(checkSha_distinct) - length(in_both)) out("notInEverythingRaw", length(newSha_distinct) - length(in_both))

This is promising, what we can do now is basically use `author`

field from checked and append it to `newSha`

which lacks the author (the original paper used committers instead as we discovered above).

stuff = checkSha %>% dplyr::select(sha, author) newSha = inner_join(newSha, stuff, "sha")

Now in order to properly compare `newSha`

and its version we obtained from `everything`

we should remove from it any project-language pairs that have fewer than 20 commits:

checkSha = checkSha %>% group_by(project, language) %>% mutate(n = n()) %>% filter(n >= 20) %>% dplyr::select(-(n));

And redo the verification that it contains everything `newSha`

does:

newSha_distinct = unique(newSha$sha) checkSha_distinct = unique(checkSha$sha) in_both = intersect(newSha_distinct, checkSha_distinct) out("notInNewSha", length(checkSha_distinct) - length(in_both)) out("notInEverything", length(newSha_distinct) - length(in_both))

OUCH. So there are now commits that are in `newSha`

, but are *not* in `everything`

summarized. Furthermore, we should be comparing based on projects + commits since a commit may be shared with more projects:

newSha_distinct = unique(paste0(newSha$project, newSha$sha)) checkSha_distinct = unique(paste0(checkSha$project, checkSha$sha)) in_both = intersect(newSha_distinct, checkSha_distinct) out("notInNewShaWithProject", length(checkSha_distinct) - length(in_both)) out("notInEverythingWithProject", length(newSha_distinct) - length(in_both))

OUCH^2. It would seem there is *lot* of duplication in `newSha`

that is not present in `everything`

. Spoiler alert: More on duplication in `newSha`

in the reproduction section.

As last thing in the data collection section, let us give `newSha`

the same treatment we gave to `everything`

, i.e. add factors, remove unused columns, etc.

# newSha also has redundancies newSha %>% dplyr::select(language, typeclass, langclass, memoryclass, compileclass, project, sha, files, committer, author, commit_age, commit_date, insertion, deletion, isbug, #bug_type, domain, btype1, btype2 ) %>% mutate(language = as.factor(language), typeclass = as.factor(typeclass), langclass = as.factor(langclass), memoryclass = as.factor(memoryclass), compileclass = as.factor(compileclass), project = as.factor(project), #committer = as.factor(committer), #bug_type = as.factor(bug_type), domain = as.factor(domain) ) -> newSha # sanity: levels in alphabetical order newSha$language <- factor(newSha$language, levels=sort(levels(newSha$language)))

### 2.2. Categorizing Languages (section 2.3)

While we did not have the code that categorizes the languages into groups, the classes are well described in the paper and the classification data is available in the `newSha`

file. As a first step we thus verify that the categorization in `newSha`

conforms to the categories explained in the paper. First, we bring the class format to single column compatible with what we use further down the study.

We reclassify `hybrid`

memory model languages (Objective-C) to `Unmanaged`

, which is what the paper later suggests (the claim is that while Objective-C can use both, unmanaged is the default case).

newSha$typeclass = revalue(newSha$typeclass, c("strong" = "Str", "weak" = "Wea")) newSha$langclass = revalue(newSha$langclass, c("functional" = "Fun", "proc" = "Pro", "script" = "Scr")) newSha$memoryclass = revalue(newSha$memoryclass, c("unmanaged" = "Unm", "managed" = "Man", "hybrid" = "Unm")) newSha$compileclass = revalue(newSha$compileclass, c("dynamic" = "Dyn", "static" = "Sta")) newSha$combinedOriginal = paste(newSha$langclass, newSha$compileclass, newSha$typeclass, newSha$memoryclass)

Now, classify the languages according to the classification as described in the FSE paper:

newSha$combined = classifyLanguageAsOriginal(newSha$language)

Make sure that the original and our recreated classification is identical:

check(identical(newSha$combinedOriginal, newSha$combined))

So no errors in this section. Let's remove `combine`

column:

newSha = newSha %>% dplyr::select(-(combined))

Next, the paper actually ignores TypeScript from the language category analysis saying that TypeScript does not properly fit into the categories, so we assign a special category (`Other`

) to it so that we can easily remove it later. Finally, we convert the `combinedOriginal`

column into a factor.

newSha$combinedOriginal[newSha$language == "Typescript"] = "Other" newSha = newSha %>% mutate(combinedOriginal = as.factor(combinedOriginal))

#### 2.2.1. Better Language Classification

Although the language categorization in the data is same as in the paper (modulo the hybrid memory class for Objective-C and categories given for Typescript), it is wrong. We have both reclassified the obvious mistakes and marked two other languages, that do not properly fit the categories (Scala and Objective-C). Furthermore we categorize Typescript as scripting, dynamic, weak and managed, since it fully conforms to this category:

Category | Per Paper | Per Reality ----------------------------------|--------------------------------|------------- Functional-Static-Strong-Managed | haskell scala | haskell

Functional-Dynamic-Strong-Managed | clojure erlang | (empty) Proc-Static-Strong-Managed | java go cs | java go c# Proc-Static-Weak-UnManaged | c cpp | C C++ Script-Dynamic-Weak-Managed | coffee js perl php | Python Perl Ruby JavaScript Php CoffeeScript Typescript Script-Dynamic-Strong-Managed | python ruby | (empty) Functional-dynamic-weak-managed | (empty) | clojure erlang Other | Typescript | Scala Objective-C

newSha$combinedCorrected = as.factor(classifyLanguageCorrected(newSha$language))

### 2.3. Identifying Project Domain (2.4)

The artifact available to us does not contain the code for the identification and the paper does not specify it in detail enough for us to repeat it from scratch. The data however contain the `domain`

column which corresponds to the domain associated in this section. We will use this in subsequent analysis.

### 2.4. Categorizing Bugs (2.5)

The artifact again did not contain any code towards this goal, nor did we find the results of such categorization anywhere in the data. We will revisit this in RQ4.

### 2.5. Saving the processed data

This is not actual part of the repetition, but we save the `newSha`

and `checkSha`

for future use:

write.csv(newSha, paste0(WORKING_DIR, "/Data/newSha.csv")) write.csv(checkSha, paste0(WORKING_DIR, "/Data/checkSha.csv"))

### 2.6. Are some languages more defect prone than others? (chapter 3, RQ 1)

Get the data, summarize the commits information per project & language and log transform according to the paper (in the code we have obtained log10 and log are used for different rows of the model)

newSha$combined = newSha$combinedOriginal newSha$devs = newSha$committer X = summarizeByLanguage(newSha) Y = logTransform(X, log10, log)

#### 2.6.1. Fit the Negative Binomial Regression

The Negative Binomial model is fit on the original scale of bcommits. However, the predictors were log-transformed to avoid influential observations (as was in the study). First verify the weighted contrasts to conform to the original:

contr.Weights(Y$language)

We see diagonal with ones and the last line (Typescript) contains the weighted contrasts, this looks good.

Let's fit the model, and then the releveled model to get the parameter for the last language:

nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language, contrasts = list(language = contr.Weights(Y$language)), data=Y) nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language_r, contrasts = list(language_r = contr.Weights(Y$language_r)), data=Y) # combine them into single result table result = combineModels(nbfit, nbfit_r, Y$language) result

Let's now juxtapose this to the FSE paper's results. The `ok`

column contains `TRUE`

if there was a claim in the original with a significance and we support that claim at the same significance level, `FALSE`

if there was a claim and we do not support it, or support it at worse significance, and `NA`

if there was no claim in the original to begin with.

juxt = merge(result, baselineFSE_RQ1(), by = 0, all = T, sort = F) juxt$ok = checkPValues(juxt, "FSE_pv", "pVal") juxt

Outside of the control variables, the only value we were not able to repeat is PHP. We believe this is a simple typo in the original paper (the authors themselves have noticed this in the CACM reprint of the paper), so we have created updated baseline:

juxt = merge(result, baselineFSE_RQ1_fixed(), by = 0, all = T, sort = F) juxt$ok = checkPValues(juxt, "FSE_pv", "pVal") juxt

Better. We were able to repeat the RQ1 convincingly.

#### 2.6.2. Deviance Analysis

The authors also display numbers for deviance analysis. They find that log commits is the single most contributing factor, with second most important being the languages at less than 1%:

anova(nbfit)

Well, this looks different indeed. This is because we have different order of the control varibles and `anova`

checks type-I tests so order matters. If we change the order to correspond to the original manuscript, we can replicate:

nbfit_order_fixed = glm.nb(bcommits~lcommits+lmax_commit_age+ltins+ldevs+language, contrasts = list(language = contr.Weights(Y$language)), data=Y) anova(nbfit_order_fixed)

However, proper way would be to use `Anova`

function which is order independent as it does type-II tests:

Anova(nbfit)

### 2.7. Which Language Properties Relate to Defects (chapter 3, RQ 2)

First, let's drop Typescript, which now belongs to the `Other`

category since we are not interested in it:

newSha$combined = newSha$combinedOriginal newSha$devs = newSha$committer X = summarizeByLanguage(newSha) Y = logTransform(X, log10, log) Y = Y %>% filter(combined != "Other") Y = droplevels(Y)

Let's check the weights first, as we did in RQ1:

contr.Weights(Y$combined)

Fit the model, similarly to RQ1:

nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+combined, contrasts = list(combined = contr.Weights(Y$combined)), data=Y) nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+combined_r, contrasts = list(combined_r = contr.Weights(Y$combined_r)), data=Y) # combine them into single result table result = combineModels(nbfit, nbfit_r, Y$combined) result

And let's juxtapose this against the original FSE paper:

juxt = merge(result, baselineFSE_RQ2(), by = 0, all = T, sort = F) juxt$ok = checkPValues(juxt, "FSE_pv", "pVal") juxt

Using the original classification, we can repeat the RQ2 results properly. The only difference is the large coefficient we see in commits. This is because apparently this time the FSE paper used natural logarithms everywhere, so we try that too:

Now let's look if this changes when we use the updated classification:

newSha$combined = newSha$combinedCorrected Xcor = summarizeByLanguage(newSha) Ycor = logTransform(Xcor, log, log) Ycor = Ycor %>% filter(combined != "Other") Ycor = droplevels(Ycor) nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+combined, contrasts = list(combined = contr.Weights(Ycor$combined)), data=Ycor) nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+combined_r, contrasts = list(combined_r = contr.Weights(Ycor$combined_r)), data=Ycor) # combine them into single result table resultReclassified = combineModels(nbfit, nbfit_r, Ycor$combined) juxt = merge(resultReclassified, baselineFSE_RQ2(), by = 0, all = T, sort = F) juxt$ok = checkPValues(juxt, "FSE_pv", "pVal") juxt

We can observe that the log commits coefficient now corresponds to the original value due to the log change. Furthermore we see that the `Pro Sta Str Man`

and `Scr Dyn Wea Man`

categories are no longer significant and we cannot say anything about the `Fun Dyn Str Man`

category since there are no such languages (but these almost exclusively went to `Fun Dyn Wea Man`

, which is signifcant under the reclassification).

To summarize, we were able to replicate RQ2 properly. However, if we correct the errors in language classification into categories, the claims are much weaker - notably procedural managed languages and scripting languages are no longer significantly more likely to contain bugs.

Finally, create the table we will use in the paper:

output_RQ2_table(result, resultReclassified)

### 2.8. Does Language Defect Proneness Depend on Domain (chapter 3, RQ3)

The artifact we obtained contained no code to answer this question, so will attempt to figure the code out ourselves.

Let's try to recreate the heatmaps from RQ3 original paper. First prepare the data:

langs = length(unique(Y$language)) domains = length(unique(Y$domain)) ratio = matrix(0, langs, domains) projects = matrix(0, langs, domains) commits = matrix(0, langs, domains) bcommits = matrix(0, langs, domains) for (i in 1:nrow(Y)) { r = Y[i,] langIndex = as.numeric(r$language) domainIndex = as.numeric(r$domain) projects[langIndex, domainIndex] = projects[langIndex, domainIndex] + 1 commits[langIndex, domainIndex] = commits[langIndex, domainIndex] + r$commits bcommits[langIndex, domainIndex] = bcommits[langIndex, domainIndex] + r$bcommits } ratio = bcommits / commits

Now do the heatmap:

heatmap.2(ratio, dendrogram = "none", cellnote = round(ratio,2), trace = "none", labRow = levels(Y$language), labCol = levels(Y$domain), Rowv = NA, Colv = NA, notecol = "red", col = gray(256:0 / 256 ))

Visual comparison, since this is all we have because no read data on the heatmaps was in the artifact either does confirm that this is very similar to the first heatmap in RQ3 of the original paper. After that the paper argues that outliers needed to be removed, where outliers are projects with bug densities below 10th percentile and above 90th percentile. We therefore remove these and create the second heatmap, first creating the data in the same way as before:

ratioVect = Y$bcommits / Y$commits q10 = quantile(ratioVect, 0.1) q90 = quantile(ratioVect, 0.9) Y__ = Y[(ratioVect > q10) & (ratioVect < q90),] ratio_ = matrix(0, langs, domains) projects_ = matrix(0, langs, domains) commits_ = matrix(0, langs, domains) bcommits_ = matrix(0, langs, domains) for (i in 1:nrow(Y__)) { r = Y__[i,] rr = r$bcommits / r$commits langIndex = as.numeric(r$language) domainIndex = as.numeric(r$domain) projects_[langIndex, domainIndex] = projects[langIndex, domainIndex] + 1 commits_[langIndex, domainIndex] = commits[langIndex, domainIndex] + r$commits bcommits_[langIndex, domainIndex] = bcommits[langIndex, domainIndex] + r$bcommits } ratio_ = bcommits_ / commits_

How much data did we shed?

cat(paste("Records: ", round(100 - nrow(Y__) / nrow(Y) * 100, 2), "%\n")) cat(paste("Buggy commits: ", round(100 - sum(Y__$bcommits) / sum(Y$bcommits) * 100, 2), "%\n")) cat(paste("Commits: ", round(100 - sum(Y__$commits) / sum(Y$commits) * 100, 2), "%\n"))

Ok, that is kind of substantial. But that happens when you go 0.1 and 0.9%, should be ~ 20% of the data.

The heatmap:

heatmap.2(ratio_, dendrogram = "none", cellnote = round(ratio_,2), trace = "none", labRow = levels(Y__$language), labCol = levels(Y__$domain), Rowv = NA, Colv = NA, notecol = "red", col = gray(256:0 / 256 ))

So, this is different to me. In the original paper, C is strongest in Code Analyzer, while we see C being strongest in Database. Also Perl is very strong in Database, but we see no such thing in the paper.

Note: The paper ignores the other category and instead shows the overall results, but this has no effect on the actual domains.

Frustrated by the fact that we cannot replicate the second heatmap, we were thinking if there are other means by which we can support or deny the claims of the original RQ3. This claim is that there is no general relationship between domain and language defect proness. To demonstrate this, we can try the glm from RQ1 and RQ2, but this time see domains. If we do not find significant correlation, then since we have found significant correlation for languages already, it would follow that domains have no general effect.

Let's first summarize the data again:

newSha$combined = newSha$combinedCorrected newSha$devs = newSha$committer X = summarizeByLanguage(newSha) Y = logTransform(X, log, log) # discard the other domain Y = Y %>% filter(domain != "Other") Y = droplevels(Y)

And fit the model, this time showing the domain instead of languages or language categories. However, we do two extra things - we change the contrasts to zero sum contrasts, which makes more sense, and we adjust for multiple hypothesis using the Bonferroni method - see re-analysis.Rmd or the paper for more details on why.

nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+domain, contrasts = list(domain = "contr.sum"), data=Y) nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+domain_r, contrasts = list(domain_r = "contr.sum"), data=Y) # combine them into single result table result = combineModels(nbfit, nbfit_r, Y$domain, pValAdjust = "bonferroni") result

Looking at the results above, none of the domains is significant at level `0.01`

so we can say that our tests supports the claim made by the original authors and move to the last RQ. But before doing just that, we should generate a lateX table from our result since this table goes into the paper:

result$pVal = lessThanPvCheck001(round(result$pVal, 2)) for (i in 1:nrow(result)) { out(paste0("rqIIIname",as.roman(i)), rownames(result)[[i]]) out(paste0("rqIIIcoef", as.roman(i)), result$coef[[i]]) out(paste0("rqIIIpv", as.roman(i)), result$pVal[[i]]) }

### 2.9. What is the relation between language and bug category? (chapter 3, RQ4)

There was no column in the dataset with the same bug type labels in Table 5 in the paper. Our closest approximation to the authors' bug categories classifies the commits with Algo, Concurrency, Memory, and Programming in btype1 with those respective labels; it classifies the commits with Security, Performance, and Failure from btype2; and it classifies commits as Other according to whether they are labeled as such in btype1 and btype2.

btype_list <- mapply(c, newSha$btype1, newSha$btype2, SIMPLIFY=F) nb_fun <- function(l) { !("NB" %in% l) } btype_justbugs <- Filter(nb_fun, btype_list) numjustbugs <- length(btype_justbugs) check(numjustbugs == 559186) # Make sure btype1 != btype2 # This will rule out list("Other", "Other") different_btypes <- function(l) { length(unique(l)) == 2 } # rows with one btype listed as "Other" are # not overlapping other_fun <- function(l) { !("Other" %in% l) } overlap <- Filter(other_fun, btype_justbugs) overlap <- Filter(different_btypes, overlap) numoverlap <- length(overlap) check(numoverlap == 7417) percent_overlap <- round((numoverlap / numjustbugs)*100, digits=2)

It is unclear whether the bug categories are meant to be mutually exclusive--but the sum of their percent counts is 104.44%.

Of the 559186 buggy commits in newSha, 1.33% of them have two different bug categories assigned to them, neither of which is "Other".

We attempted to recreate the numbers in Table 5 with the unmodified data in newSha.

T5_bug_type <- c("Algorithm", "Concurrency", "Memory", "Programming", "Security", "Performance", "Failure", "Unknown") bugtype_count <- c(606, 11111, 30437, 495013, 11235, 8651, 21079, 5792) percent_count <- c(0.11, 1.99, 5.44, 88.53, 2.01, 1.55, 3.77, 1.04) # Calculate highest possible value for each category in our data set btype_algo <- nrow(newSha[newSha$btype1 == "Algo",]) btype_conc <- nrow(newSha[newSha$btype1 == "Concurrency",]) btype_mem <- nrow(newSha[newSha$btype1 == "Memory",]) btype_prog <- nrow(newSha[newSha$btype1 == "Programming",]) btype_sec <- nrow(newSha[newSha$btype2 == "Security",]) btype_perf <- nrow(newSha[newSha$btype2 == "Performance",]) btype_fail <- nrow(newSha[newSha$btype2 == "Failure",]) # There is no "FAILURE" category in bug_type btype_unknown <- nrow(newSha[newSha$btype1 == "Other" & newSha$btype2 == "Other",]) our_bugtype_count <- c(btype_algo, btype_conc, btype_mem, btype_prog, btype_sec, btype_perf, btype_fail, btype_unknown) btype_total = sum(newSha$isbug) our_percent = round((our_bugtype_count / btype_total)*100, digits=2) factordiff = mapply(function(X, Y) { round((Y / X), digits=4) }, bugtype_count, our_bugtype_count) countdiff = mapply(function(X, Y) { Y - X }, bugtype_count, our_bugtype_count) bugpercenttotal_known <- sum(percent_count[1:7]) check(bugpercenttotal_known == 103.4) bugpercenttotal <- sum(percent_count) check(bugpercenttotal == 104.44) t5_bug_categories <- data.frame(T5_bug_type, btype_count=bugtype_count, percent=percent_count, our_btype_count=our_bugtype_count, our_percent, diff_factor=factordiff, count_difference=countdiff) t5_bug_categories

These differences in the composition of the data shared with us versus that reported by the authors make the data incomparable.

remove(WORKING_DIR)

### 2.10. Summary

We are mostly able to verify the basic data collection steps, although we have found discrepancies in some numbers reported as well as two-way errors in the summarization of the unique project-commit-file data in `everything`

and `project-commit-language`

in `newSha`

, but overall the data in the artifact seem to be reasonably complete to attempt to repeat the analysis.

We were not able to repeat the project domain and bug category classifications since the artifact contain no code towards these and the description provided in the paper was not detailed enough for us to repeat from scratch.

The paper had 4 claims, of which:

We were able to completely replicate the first claim (language bug proneness) if we assume the typo in one significance level. We have found minor inconsistencies in the paper which do not have effect on the claim.

We were able to replicate the second claim (language categories), however we found mistakes in the language classification. When we corrected for them the original claim was only partially valid.

We were not able to replicate the third claim (language domains) because the data cleaning described in the paper provided not complete description and our data afterward differed significantly from the original. We have however used another method to the same end.

We were not able to replicate the fourth claim (bug categories). Input data to this problem was not available and no detailed description on how to recreate them was available in the artifact or the paper.

## 3. R Notebook (re-analysis.Rmd)

This is our attempt at re-analysis. We only concentrate on RQ1.

First, because this `Rmd`

file can be executed by external script to allow automatic generation of all possible permutations of our re-analysis cleaning steps, we define certain variables that conditionally enable features of this document. By default they are all set to `TRUE`

:

if (! exists("REMOVE_DUPLICATES")) REMOVE_DUPLICATES = T if (! exists("REMOVE_TYPESCRIPT")) REMOVE_TYPESCRIPT = T if (! exists("REMOVE_V8")) REMOVE_V8 = T if (! exists("UNCERTAINTY")) UNCERTAINTY = T if (! exists("USE_AUTHORS_INSTEAD_OF_COMMITTERS")) USE_AUTHORS_INSTEAD_COMMITTERS = T # sanity check prints cat(paste("REMOVE_DUPLICATES: ", REMOVE_DUPLICATES, "\n")) cat(paste("REMOVE_TYPESCRIPT: ", REMOVE_TYPESCRIPT, "\n")) cat(paste("REMOVE_V8: ", REMOVE_V8, "\n")) cat(paste("UNCERTAINTY: ", UNCERTAINTY, "\n")) cat(paste("USE_AUTHORS_INSTEAD_COMMITTERS: ", USE_AUTHORS_INSTEAD_COMMITTERS, "\n"))

Next, we setup working dir. Working dir can also be provided by external script, however if it is not provided, we verify that all features are enabled and set it to the default, i.e. `artifact/re-analysis`

, where the paper expects the graphs to be found.

if (! exists("WORKING_DIR")) { if (! REMOVE_DUPLICATES & REMOVE_TYPESCRIPT & REMOVE_V8 & UNCERTAINTY & USE_AUTHORS_INSTEAD_COMMITTERS) stop("If non-default flags are submitted to the re-analysis notebook, WORKING_DIR must be set.") WORKING_DIR = "./artifact/re-analysis" }

Now initialize the script's environment:

# load the file containing the actual implementation details knitr::opts_chunk$set(echo = FALSE) source("implementation.R") initializeEnvironment()

### 3.1. Data Collection

We reuse the cleaning and optimization passes from the repetition and just load the result of that:

data = read.csv("./artifact/repetition/Data/newSha.csv") initial_data = data initial_number_of_commits = length(unique(data$sha)) initial_number_of_rows = nrow(data) everything =loadEverything()

### 3.2. Data Cleaning

#### 3.2.1. Removing Duplication

First, we remove any duplicates, i.e. same commits that are present in multiple projects. We remove all occurences of commits that have duplicates since we do not know which one to pick to stay and different picks may bias the later analysis.

if (REMOVE_DUPLICATES) { sha_proj = data %>% dplyr::select(sha, project) %>% group_by(sha, project) %>% dplyr::summarize(n = n()) duplicates_sha = sha_proj %>% group_by(sha) %>% dplyr::summarize(n = n()) %>% filter(n > 1) # this is the number of commits that are present in multiple commits num_duplicates_sha = nrow(duplicates_sha) check(num_duplicates_sha == 27450) # how many projects are affected? dup_repos = data %>% filter(sha %in% duplicates_sha$sha) dup_repos = unique(dup_repos$project) num_dup_repos = length(dup_repos) check(num_dup_repos == 33) # determine how many commits are there in the affected projects in total commits_by_dup_repos = data %>% filter(project %in% dup_repos) %>% group_by(sha) num_commits_by_dup_repos = nrow(commits_by_dup_repos) # since we can't do better, exclude all duplicate commits from the dataset data = data %>% filter(! sha %in% duplicates_sha$sha); # some reporting for the paper out("numberOfProjectsWithDuplicates", num_dup_repos) out("numDuplicateCommits", num_duplicates_sha) out("percentageDuplicateCommits", round(num_duplicates_sha/initial_number_of_commits * 100,2)) out("percentageDuplicateRowsLost", round(100 - nrow(data) / initial_number_of_rows * 100, 2)) hist(duplicates_sha$n, breaks = 100) }

By removing the duplicates we may have created project-language pairs that have fewer than 20 commits. The original study excluded such pairs and we see no reason not to repeat this. Let's find any such rows and delete them

if (REMOVE_DUPLICATES) { num_rows_before_cleanup = nrow(data) not_keep = data %>% group_by(project, language) %>% dplyr::summarize(n = n()) %>% filter(n < 20) keep = data %>% group_by(project, language) %>% dplyr::summarize(n = n()) %>% filter(n >= 20) data = keep %>% inner_join(data, by=c("project", "language")) out("smallProjectCommits", nrow(not_keep)) out("percentageDuplicationReadjustmentRowsLost", round(100 - nrow(data)/num_rows_before_cleanup * 100, 2)) }

#### 3.2.2. Removing Typescript

The Typescript language was released in October 1, 2012 and is the smallest language in the study. However, its first commit in the study is dated 2003. What is happening here is that that the `.ts`

extension is used for translation files, not Typescript, which was not detected by the original study.

if (REMOVE_TYPESCRIPT) { ts = data %>% filter(language == "Typescript") ts_rows_num = nrow(ts) ts_proj_num = length(unique(ts$project)) check(ts_proj_num == 53) out("initialNumTSProjects", ts_proj_num) ts_commits_num <- length(unique(ts$sha)) check(ts_commits_num == 10105) out("initialNumTSCommits", ts_commits_num) ts_dates <- sort(unique(ts$commit_date)) # "2003-03-01" check(ts_dates[1] == "2003-03-21") out("tsFirstCommit", ts_dates[1]) }

We now remove from TS what we manually confirmed to be translations and not actual TypeScript:

if (REMOVE_TYPESCRIPT) { translation_projects <- c("mythtv", "hw", "SpeedCrunch", "qBittorrent", "mumble", "pokerth", "hydrogen", "unetbootin", "tiled", "goldendict", "zf2", "Cockatrice", "tagainijisho", "focuswriter", "LibreCAD", "razor-qt", "qupzilla", "tomahawk", "ppcoin", "mirall", "MuseScore", "shotcut") ts = ts %>% filter(! project %in% translation_projects) check(length(unique(ts$sha)) == 4456) ts = ts %>% filter(!str_detect(project, "coin")) %>% # remove all the *coin projects filter(!str_detect(project, "Coin")) %>% # remove all the *Coin projects filter(!str_detect(project, "change")) # remove all the *change projects # TODO why are these in remaining??? remaining_translations <- c("antimicro", "identifi", "octopi", "pcbsd", "subsurface", "ttrss", "wps_i18n") ts = ts %>% filter(!project %in% remaining_translations) real_ts_num_proj <- length(unique(ts$project)) check(real_ts_num_proj == 16) out("realTSProjNum", real_ts_num_proj) real_ts_num_commit <- length(unique(ts$sha)) check(real_ts_num_commit == 3782) out("realTSCommitsNum", real_ts_num_commit) real_ts_proj <- unique(ts$project) }

Another issue with Typescript is that a lot of its files are only type definitions, and contain no actual code in Typescript but headers for Javascript functions and classes elsewhere. We should exclude these as well:

if (REMOVE_TYPESCRIPT) { # typescript-node-definitions, DefinitelyTyped, tsd # DEPRECATED: TSD is deprecated, please use Typings and see this issue for more information. tdefs1 = ts %>% filter(project=="typescript-node-definitions") tdefs2 = ts %>% filter(project=="DefinitelyTyped") tdefs3 = ts %>% filter(project=="tsd") sts <- length(unique(ts$sha)) st1 <- length(unique(tdefs1$sha)) st2 <- length(unique(tdefs2$sha)) st3 <- length(unique(tdefs3$sha)) ratio <- round((st1 + st2 + st3)/sts*100, 1) check(ratio == 34.6) out("ratioOfTypeDefTSCommits", ratio) ts = ts %>% filter(project !="typescript-node-definitions") %>% filter(project !="DefinitelyTyped") %>% filter(project !="tsd") }

So what are we left with?

if (REMOVE_TYPESCRIPT) { ts_valid_rows_num = nrow(ts) ts_valid_commits = length(unique(ts$sha)) ts_valid_projects = length(unique(ts$project)) out("tsValidRows", ts_valid_rows_num) out("tsValidCommits", ts_valid_commits) out("tsValidProjects", ts_valid_projects) }

We are down to only 13 projects out of 50 and given the fact Typescript was the smallest language to begin with and how much of its commits we had to remove with minimal effort, the only safe thing we can do is to exclude Typescript from the analysis completely:

if (REMOVE_TYPESCRIPT) { num_rows_before_ts = nrow(data) data = data %>% filter(language != "Typescript") out("percentageTypescriptRowsLost", round(100 - nrow(data) / num_rows_before_ts * 100, 2)) data$language = factor(as.character(data$language)) }

### 3.3. Removing V8

The V8 project is plagued with errors and inaccuracies:

if (REMOVE_V8) { v8 = everything %>% filter(project == "v8") out("vCCommits", length(unique(v8[v8$tag == "c",]$sha))) out("vCppCommits", length(unique(v8[v8$tag == "cpp",]$sha))) out("vJavascriptCommits", length(unique(v8[v8$tag == "javascript",]$sha))) out("vPythonCommits", length(unique(v8[v8$tag == "python",]$sha))) }

The paper ignores all of V8's C++ files (the `.cc`

and `.h`

file extensions are ignored) and classifies V8 as a JavaScript project. This is obviously wrong and since V8 is one of the larger projects, may skew the analysis substantially. To avoid this, we remove V8 from the data:

if (REMOVE_V8) { num_rows_before_v8 = nrow(data) data = data %>% filter(project != "v8") out("percentageVEightRowsLost", round(100 - nrow(data) / num_rows_before_v8 * 100, 2)) }

#### 3.3.1. Summary

Let's summarize the final numbers after all cleaning:

newShaNum <- length(unique(data$sha)) ratioSha <- round(100 - (newShaNum/initial_number_of_commits*100),2) out("finalNumSha", newShaNum) out("finalNumShaMio", round(newShaNum/1000/1000,1)) out("ratioReducedSha", ratioSha) out("ratioReducedShaRows", round(100 - nrow(data) / nrow(initial_data) * 100, 2)) f_number_of_projects <- length(unique(data$project)) check(f_number_of_projects == 719) out("finalNumberOfProjectsIncluded", f_number_of_projects) f_sloc <- sum(data$insertion) - sum(data$deletion) f_sloc_mio <- round(f_sloc / 1000000, 1) check(f_sloc_mio == 58.2) out("finalSlocMio", f_sloc_mio) f_number_authors <- length(unique(data$author)) check(f_number_authors == 46204) out("finalNumberAuthors", round(f_number_authors/1000,0)) f_bugFixes = data %>% filter(isbug == 1) f_numberOfBugFixes <- length(unique(f_bugFixes$sha)) out("finalNumberOfBugFixes", f_numberOfBugFixes)

Let's see how much data we shed in different categories:

cat(paste("Commits (unique): ", round(100 - length(unique(data$sha)) / length(unique(initial_data$sha)) * 100, 2), "%\n")) cat(paste("Commits (rows): ", round(100 - nrow(data) / nrow(initial_data) * 100, 2), "%\n")) cat(paste("Buggy commits (rows): ", round(100 - sum(data$isbug) / sum(initial_data$isbug) * 100, 2), "%\n")) cat(paste("Projects: ", round(100 - length(unique(data$project)) / length(unique(initial_data$project)) * 100, 2), "%\n")) cat(paste("SLOC: ", round(100 - sum(data$insertion - data$deletion) / sum(initial_data$insertion - initial_data$deletion) * 100, 2), "%\n")) cat(paste("Languages: ", round(100 - length(unique(data$language)) / length(unique(initial_data$language)) * 100, 2), "%\n")) cat(paste("Authors: ", round(100 - length(unique(data$author)) / length(unique(initial_data$author)) * 100, 2), "%\n"))

### 3.4. Summarization

Summarize the data for the graphs and modelling as we did in the repetition:

data$combined = data$combinedOriginal # does not matter if (USE_AUTHORS_INSTEAD_COMMITTERS) { data$devs = data$author } else { data$devs = data$committer } X = summarizeByLanguage(data) Y = logTransform(X, log, log)

### 3.5. Graphs

Let's do some graphs:

#### 3.5.1. Commits vs. Bugfixes.

data %>% group_by(language) %>% summarize(commits = n(), bugfixes = sum(isbug)) -> df ggplot(data=df, aes(x=commits, y=bugfixes)) + geom_smooth(method='lm',size=.5, formula = y ~ x) + geom_point() + geom_text_repel(aes(label=language), segment.color = 'grey50', fontface = 'bold', size = 5) + theme_light() + theme(text = element_text(size = 15)) + labs(y="Bug-fixing commits", x = "Commits") + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) + scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) -> p ggsave(paste(WORKING_DIR, "/Figures/commits.pdf", sep = "")) print(p)

#### 3.5.2. Bug rate vs project age for selected languages

X %>% dplyr::select(project, language, commits, bcommits,max_commit_age) %>% mutate(br=round(bcommits/commits*100,1)) -> df most_commits <- .9 * sum(X$commits) df %>% arrange(desc(commits)) -> df # find the number of project that would comprise over 90% of the entire dataset accumulatedCommits = 0 i = 0 while (accumulatedCommits <= most_commits) { i = i + 1 accumulatedCommits = accumulatedCommits + df[i, 3] } # verify that we have the correct nunber check(sum(df[1:i,3]) > most_commits) check(sum(df[1:i-1,3]) <= most_commits) df[1:i,] ->most_df most_df %>% filter(language %in% c("C", "C++","Clojure","Haskell","Objective-C","Scala")) -> most_df2 br_mean <- most_df2 %>% group_by(language) %>% summarise(br=mean(br)) age_mean <- most_df2 %>% group_by(language) %>% summarise(age=mean(max_commit_age)) ggplot(data=most_df2, aes(x=max_commit_age,y=br,color=language)) + geom_point() + geom_hline(aes(yintercept=br, color='black'), size=.25, br_mean) + geom_vline(aes(xintercept=age), size=.25, age_mean) + theme_light() + theme(text = element_text(size=20)) + scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) + labs(x="Project age (days)", y="Bug rate (percent of commits)") + theme(legend.position = "none") + facet_wrap(~language, ncol=2) -> p X %>% mutate(br=round(bcommits/commits*100,1)) %>% group_by(language) %>% summarize(mbr = mean(br), mage = mean(max_commit_age)) %>% arrange(desc(mbr)) -> mbr ggsave(paste(WORKING_DIR, "/Figures/age.pdf", sep = ""), width=8, height=8) print(p)

#### 3.5.3. Developer behavior

data %>% group_by(author) %>% summarize(n=n()) %>% arrange(desc(n)) -> devs most_commits_bydev <- .9*sum(devs$n) # find the number of developers accounting for <= 90% of the entire dataset accumulatedCommits = 0 d = 0 while (accumulatedCommits <= most_commits_bydev) { d = d + 1 accumulatedCommits = accumulatedCommits + devs[d, 2] } # verify that we have the correct nunber check(sum(devs[1:d-1,2]) < most_commits_bydev) check(sum(devs[1:d,2]) > most_commits_bydev) devs_prolific <- devs$author[1:d] # Show the number of languages the most prolific authors committed for data %>% filter(author %in% devs_prolific) -> devs_prolific devs_prolific = devs_prolific %>% group_by(author,language) %>% summarize(n=n()) %>% summarize(l=n()) %>% arrange(desc(l)) devs_prolific hist(devs_prolific$l)

### 3.6. Modelling

#### 3.6.1. Basic analyses & graphs of the dataset

dimnames(X)[[2]] pairs(X[,3:7], gap=0, pch='.')

Let's log transform the data:

Y = logTransform(X) pairs(Y[,2:6], gap=0, pch='.')

#### 3.6.2. Language Frequencies

Explore the relative frequencies of the languages--how many projects use them in the dataset

sort(table(Y$language))

The dataset contains 17 languages (16, if we remove TypeScript). They were represented by unevenly distributed numbers of projects (from 23 for Perl to 199 for Javascript).

Explore the relationship between the languages, and other measurements

# (devs, tins, max_commit_age, commits, bcommits) boxplot(split(Y$lbcommits, Y$language), las=2, main="Log(bugged commits, 0 replaced with 0.5)") par(mfrow=c(1,2), oma=c(1,1,1,1), mar=c(5,1,2,1)) boxplot(split(Y$ldevs, Y$language), las=2, main="Log(devs)") boxplot(split(Y$ltins, Y$language), las=2, main="Log(tins)") boxplot(split(Y$lmax_commit_age, Y$language), las=2, main="Log(max_commit_age)") boxplot(split(Y$lcommits, Y$language), las=2, main="Log(commits)")

The distribution of bug-commits varied between the languages, but so did the distributions of the other measurements. It is difficult to see direct associations between languages and bugs from bivariate plots.

#### 3.6.3. Fitting the Negative Binomial

Let's now fit the negative binomial and compare to the original paper - this is the same code as in replication:

nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language, contrasts = list(language = contr.Weights(Y$language)), data=Y) nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language_r, contrasts = list(language_r = contr.Weights(Y$language_r)), data=Y) # combine them into single result table resultWeighted = combineModels(nbfit, nbfit_r, Y$language) juxtWeighted = merge(resultWeighted, baselineFSE_RQ1(), by = 0, all = T, sort = F) juxtWeighted$ok = checkPValues(juxtWeighted, "FSE_pv", "pVal") juxtWeighted

And we see that just cleaning the data did invalidate several of the language claims made by the original paper.

#### 3.6.4. Fitting the zero-Sum Contrasts Negative Binomial

The zero-sum contrasts are preferred to weighted contrasts for their better stability. Let's look at how they look:

contr.sum(length(levels(Y$language)))

And let's fit the model:

nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language, contrasts = list(language = contr.sum), data=Y) nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language_r, contrasts = list(language_r = contr.sum), data=Y) # combine them into single result table resultZeroSum = combineModels(nbfit, nbfit_r, Y$language) juxtZeroSum = merge(resultZeroSum, baselineFSE_RQ1(), by = 0, all = T, sort = F) juxtZeroSum$ok = checkPValues(juxtZeroSum, "FSE_pv", "pVal") juxtZeroSum

Some invalidatuions still.

#### 3.6.5. Fit Negative Binomial regression without languages and compare the full and the reduced models

nbfit_reduced <- glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits, data=Y) summary(nbfit_reduced)

Comparing two nested models, i.e.

anova(nbfit_reduced, nbfit)

cat("AIC, full:", AIC(nbfit), "\n") cat("AIC, reduced:", AIC(nbfit_reduced), "\n") cat("BIC, full:", BIC(nbfit), "\n") cat("BIC, reduced:", BIC(nbfit_reduced), "\n")

The difference between the models is borderline.

#### 3.6.6. Adjusting for Multiple Hypothesis

The original paper just compares the p-Values against thresholds, which is not correct way to do. In the presence of multiple hypothesis, the p-Values must be adjusted. There are various ways to adjust the p-Values, namely the Bonferroni and FDR (Benjamini & Hochberg). The FDR adjustment is more permissive and the Bonferrioni is the more conservative one. What we can do now is to revisit the juxtaposed tables and add extra columns for the FDR and Bonferroni adjustments:

# the the pValues for the predictions, not for the control variables since these are ignored by the adjustments pValWeighted = juxtWeighted$pVal[6:(5 + length(unique(Y$language)))] # create the empty vectors with NAs instead of pValues fdrWeighted = rep(NA, nrow(juxtWeighted)) bonfWeighted = rep(NA, nrow(juxtWeighted)) # update the relevant parts of the vectors with the adjusted pValues fdrWeighted[6:(5+ length(pValWeighted))] = round(p.adjust(pValWeighted, "fdr"), 3) bonfWeighted[6:(5+ length(pValWeighted))] = round(p.adjust(pValWeighted, "bonferroni"), 3) # add the columns to the juxtaposed tables juxtWeighted$pVal_fdr = fdrWeighted juxtWeighted$pVal_bonf = bonfWeighted # now remove the old ok column and add the ok, ok_fdr and ok_bonf columns juxtWeighted = juxtWeighted %>% dplyr::select(-(ok)) # add the columns juxtWeighted$ok = checkPValues(juxtWeighted, "FSE_pv", "pVal") juxtWeighted$ok_fdr = checkPValuesLevel(juxtWeighted, "FSE_pv", "pVal_fdr", 0.01) juxtWeighted$ok_bonf = checkPValuesLevel(juxtWeighted, "FSE_pv", "pVal_bonf", 0.01) juxtWeighted

With the exception of the last language (Coffeescript), the FDR and Bonferroni are the same and they actually invalidate some of the original predictions. Let's look at the zero-sum contrasts version now:

# the the pValues for the predictions, not for the control variables since these are ignored by the adjustments pValZeroSum = juxtZeroSum$pVal[6:(5 + length(unique(Y$language)))] # create the empty vectors with NAs instead of pValues fdrZeroSum = rep(NA, nrow(juxtZeroSum)) bonfZeroSum = rep(NA, nrow(juxtZeroSum)) # update the relevant parts of the vectors with the adjusted pValues fdrZeroSum[6:(5+ length(pValZeroSum))] = round(p.adjust(pValZeroSum, "fdr"), 3) bonfZeroSum[6:(5+ length(pValZeroSum))] = round(p.adjust(pValZeroSum, "bonferroni"), 3) # add the columns to the juxtaposed tables juxtZeroSum$pVal_fdr = fdrZeroSum juxtZeroSum$pVal_bonf = bonfZeroSum # now remove the old ok column and add the ok, ok_fdr and ok_bonf columns juxtZeroSum = juxtZeroSum %>% dplyr::select(-(ok)) # add the columns juxtZeroSum$ok = checkPValues(juxtZeroSum, "FSE_pv", "pVal") juxtZeroSum$ok_fdr = checkPValuesLevel(juxtZeroSum, "FSE_pv", "pVal_fdr", 0.01) juxtZeroSum$ok_bonf = checkPValuesLevel(juxtZeroSum, "FSE_pv", "pVal_bonf", 0.01) juxtZeroSum

The situation is fairly similar here.

#### 3.6.7. Statistical significance vs practical significance, for languages

Since the number of observations is large, statistical significance can be driven by the sample size without being meaningful in practice.

Here we contrast model-based confidence intervals and prediction intervals, on the log and on the original scale

numLanguages = length(unique(Y$language)) # Create a new data structure for prediction newY <- NULL for (i in 1:numLanguages) { newY <- rbind(newY, data.frame(language=rep(levels(Y$language)[i], 100), ldevs=rep(median(Y$ldevs), 100), lcommits=seq(from=min(Y$lcommits), to=max(Y$lcommits), length=100), ltins=rep(median(Y$ltins), 100), lmax_commit_age=rep(median(Y$lmax_commit_age), 100))) } newY$commits <- exp(newY$lcommits) # Make predictions pr_nbfit <- predict(nbfit, type="response", newdata=newY, se.fit=TRUE) newY$pr_mean <- pr_nbfit$fit newY$pr_se <- pr_nbfit$se.fit

Consider languages with the most predicted bugs (C++) and fewest predicted bugs (Clojure). Compute the log CI for C++ and Clojure and the log Prediction CI. Then translate the intervals on the original scale.

axfont = 32 # size of axis title font ticfont = 26 # size of axes' font legfont = 28 # size of legend title legtext = 26 # size of legend text ptitle = 30 # size of plot title letters getConfInterval<-function(df,lang) { df %>% filter(language==lang) %>% mutate(language = lang, x = lcommits, y = log(pr_mean), yhigh = log(pr_mean + qnorm(1-0.01/numLanguages) * pr_se), ylow = log(pr_mean - qnorm(1-0.01/numLanguages) * pr_se)) %>% dplyr::select(language, x, y, ylow, yhigh) } dfCI <- rbind(getConfInterval(newY, "C++"), getConfInterval(newY, "Clojure")) getPredInterval<-function(df,lang) { df %>% filter(language==lang) %>% mutate(language = lang, x = lcommits, y = log(pr_mean), yhigh = log(qnbinom(1-0.01/numLanguages, mu= pr_mean, size= nbfit$theta) ), ylow = log(qnbinom(0.01/numLanguages, mu=pr_mean, size=nbfit$theta))) %>% dplyr::select(language, x, y, ylow, yhigh) } dfPI <- rbind(getPredInterval(newY, "C++"), getPredInterval(newY, "Clojure")) plotIt <- function(df) { ggplot(data = df, aes(x=x,y=y,color=language)) + geom_line() + geom_ribbon(aes(ymin=df$ylow, ymax=df$yhigh), linetype=2, alpha=0.1) + labs(x="log of commits", y="log of bug-fixing commits") + theme_light() + theme(axis.title = element_text(size=axfont), axis.text = element_text(size=ticfont), plot.title = element_text(hjust = 0.5, size = ptitle)) } plotIt(dfCI) + theme(legend.position = "none") + ggtitle("(a)") -> p1 plotIt(dfPI) + theme(legend.title = element_text(size=legfont), legend.text = element_text(size=legtext)) + guides(colour = guide_legend(nrow = 1)) + ggtitle("(b)") -> p2 # Grab the legend to display it separately tmp <- ggplot_gtable(ggplot_build(p2)) l <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[l]] legend p2 + theme(legend.position = "none") -> p2 getConfInterval <- function(df,lang) { df %>% filter(language==lang) %>% mutate(language = lang, x = commits, y = pr_mean, yhigh = pr_mean + qnorm(1-0.01/numLanguages) * pr_se, ylow = pr_mean - qnorm(1-0.01/numLanguages)* pr_se ) %>% dplyr::select(language, x, y, ylow, yhigh) } dfCI <- rbind(getConfInterval(newY, "C++"), getConfInterval(newY, "Clojure")) getPredInterval<-function(df,lang) { df %>% filter(language==lang) %>% mutate(language = lang, x = commits, y = pr_mean, yhigh = qnbinom(1-0.01/numLanguages, mu= pr_mean, size= nbfit$theta) , ylow =qnbinom(0.01/numLanguages, mu=pr_mean, size=nbfit$theta)) %>% dplyr::select(language, x,y,ylow,yhigh) } dfPI <- rbind(getPredInterval(newY, "C++"), getPredInterval(newY, "Clojure")) plotIt <- function(df) { ggplot(data = df, aes(x=x,y=y,color=language)) + geom_line() + geom_ribbon(aes(ymin=df$ylow, ymax=df$yhigh), linetype=2, alpha=0.1) + theme_light() + theme(axis.title = element_text(size=axfont), axis.text = element_text(size=ticfont), plot.title = element_text(hjust = 0.5, size = ptitle), legend.position = "none") + labs(x="commits", y="bug-fixing commits") } plotIt(dfCI) + xlim(0,800) + ylim(0,400) + ggtitle("(c)") -> p3 plotIt(dfPI) + xlim(0,800) + ylim(0,620) + ggtitle("(d)") -> p4

pdf(paste(WORKING_DIR, "/Figures/intervals.pdf", sep = ""), width=32, height=8) figs <- grid.arrange(arrangeGrob(p1, p2, p3, p4, nrow=1), arrangeGrob(legend), nrow=2, heights=c(4,1)) dev.off()

# Plot predicted means for all the languages par(mfrow=c(1,2)) # Log scale plot(log(pr_mean)~lcommits, main="log(Exp. values), all languages", data=newY[newY$language==levels(Y$language)[1],], type="l", ylab="log(bugged commits)") for (i in 2:numLanguages) { lines(log(pr_mean)~lcommits, data=newY[newY$language==levels(Y$language)[i],]) } # Original scale plot(pr_mean~commits, main="Exp. values, all languages", data=newY[newY$language==levels(Y$language)[1],], type="l", xlim=c(0, 800), ylim=c(0,400), ylab="bugged commits") for (i in 2:numLanguages) { lines(pr_mean~commits, data=newY[newY$language==levels(Y$language)[i],]) }

#### 3.6.8. Add uncertainty in labels

if (UNCERTAINTY) { fp <- 0.36 fn <- 0.11 # Function to get parameter values getParams <- function(Ystar) { # Fit NB with standard contrasts nbfit <- glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language, contrasts = list(language = "contr.sum"), data=Ystar) s <- summary(nbfit)$coefficients # Fit the releveled model with standard contrasts, # to get the parameter for Scala nbfit_r <- glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language_r, contrasts = list(language_r = "contr.sum"), data=Ystar) s_r <- summary(nbfit_r)$coefficients # Return params, incluing Scala out <- c(s[,1], s_r[6, 1]) names(out) <- c(dimnames(s)[[1]][1:5], levels(Ystar$language)) out } numBootstrapIterations = 10000 # Perform sampling set.seed(1) paramNum <- numLanguages + 5 paramsLang <- matrix(rep(NA,paramNum*numBootstrapIterations), nrow=paramNum) for (i in 1:numBootstrapIterations) { # Sample rows Ystar <- Y[sample(1:nrow(Y), replace = T),] # Adjust for tp and fp: # Reduce bcommits by prob fp # Increase non-bugged commits by prob fn tmp <- rbinom(n=nrow(Ystar), size=Ystar$bcommits, prob=1-fp) + rbinom(n=nrow(Ystar), size=(Ystar$commits-Ystar$bcommits), prob=fn) Ystar$bcommits <- tmp # Get parameters paramsLang[,i] <- getParams(Ystar) } }

To determine whether the bootstrapped values are statistically signifficant, we analyze whether the x and 1-x quantiles both have the same sign. If they do, the result is signifficant in that regard, if the do not, then the results is not statistically signifficant. Similarly to p-values, we can compare using different quantiles. The proper, conservative quantile is 0.01 divided by number of languages tested, the less conservative option is just 0.01.

if (UNCERTAINTY) { paramsRowNames = getModelRowNames(nbfit, Y$language) result = data.frame( row.names = paramsRowNames, coef = rep(NA, length(paramsRowNames)), se = rep(NA, length(paramsRowNames)), sig = rep(NA, length(paramsRowNames)), sigCons = rep(NA, length(paramsRowNames)) ) par(mfrow=c(2,4)) quant = 0.01 quantCons = 0.01 / numLanguages for ( i in 1:length(paramsRowNames)) { result$coef[[i]] = round(mean(paramsLang[i,]), digits = 2) result$se[[i]] = round(sd(paramsLang[i,]), digits = 2) qsigns = sign(quantile(paramsLang[i,], probs = c(quant, 1-quant, quantCons, 1-quantCons), na.rm = T)) result$sig[[i]] = qsigns[[1]] == qsigns[[2]] result$sigCons[[i]] = qsigns[[3]] == qsigns[[4]] hist(paramsLang[i,], xlab=paramsRowNames[i], main=paste("mean =", round(mean(paramsLang[i,]), digits=2)) ) abline(v=0, col="red", lwd=2) } }

if (UNCERTAINTY) { result }

Finally, let's juxtapose this to the baseline as usual:

if (UNCERTAINTY) { juxtBootstrap = merge(result, baselineFSE_RQ1(), by = 0, all = T, sort = F) juxtBootstrap$ok = checkSignificance(juxtBootstrap, "FSE_pv", "sigCons") juxtBootstrap }

### 3.7. More graphs: Bug rate over time

filter_proj <- function(df,proj,lang) { data %>% ungroup() %>% filter(project == proj, language == lang) %>% dplyr::select(commit_age, isbug) %>% arrange(commit_age) -> bt bt %>% group_by(commit_age,isbug) %>% summarize(n = n()) %>% spread(key = "isbug",value = "n") -> bt2 bt2 %>% mutate(br = round(`0`/(`0`+`1`),2), month = as.integer(commit_age/30)) -> bt3 bt3 %>% na.omit() %>% group_by(month) %>% summarize(n = n(), brs = sum(br), brm = brs/n) %>% mutate(name= paste0(proj," ",lang)) -> prj rbind(df, prj) } filter_proj(NULL, "linux","C") %>% filter_proj("mono","C#") %>% filter_proj("homebrew","Ruby") %>% filter_proj("WordPress","Php") %>% filter_proj("salt","Python") %>% filter_proj("mythtv","C++") %>% filter_proj("jenkins","Java") %>% filter_proj("akka","Scala") %>% filter_proj("rabbitmq-server","Erlang") %>% filter_proj("brackets","Javascript") %>% filter_proj("yi","Haskell") %>% filter_proj("overtone","Clojure") ->prj prj %>% ggplot( aes(x = month, y = brm)) + geom_point(size=.4) + theme_light() + geom_smooth(method='lm',size=.5, formula = y ~ x) + facet_wrap( ~ name,ncol=3) + labs(x="",y="") + xlab("Project lifetime (months)") + ylab("Percent bug-labeled commits") + theme(strip.text.x = element_text(size = 14), text = element_text(size=13)) + theme(text = element_text(size=20)) ggsave(paste(WORKING_DIR, "/Figures/bugspermonth.pdf", sep = ""), width = 8, height = 8, dpi = 100)

### 3.8. Commits for top-5 projects by language

data %>% group_by(language, project) %>% summarize(n = n()) %>% arrange(desc(n)) %>% arrange(desc(language)) %>% top_n(5, wt = n) -> projsize_ordered2 projsize_ordered2 projsize_ordered2 %>% ggplot(aes(x = reorder(factor(project), -n), y = n)) + geom_bar(stat="identity") + facet_grid(. ~ language, scales = "free") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(x = "Project (by language)", y = "Number of Commits") + scale_y_continuous(labels = scales::comma)

### 3.9. Authors

everything %>% group_by(author, sha) %>% summarize(n=n()) %>% group_by(author) %>% summarise(commits = n()) -> auth auth %>% arrange(desc(commits)) tot <- sum(auth$commits) check( tot == 1485049) top <- sort(auth$commits, dec=T)[1:453] sum(top)/tot ## 46 % everything %>% group_by(author, project) %>% summarize(n=n()) %>% group_by(author) %>% summarise(proj = n()) %>% arrange(desc(proj)) -> authproj summary(authproj) ## mean 1.2

### 3.10. Writing the results

Finally, let's write the results of our analyses into CSV files so that they can be picked and analyzed later:

weighted = data.frame( coef = juxtWeighted$coef, se = juxtWeighted$se, pVal = round(juxtWeighted$pVal,3), pVal_fdr = juxtWeighted$pVal_fdr, pVal_bonf = juxtWeighted$pVal_bonf, row.names = juxtWeighted$Row.names ) write.csv(weighted, paste0(WORKING_DIR, "/Data/languages_weighed.csv")) zeroSum = data.frame( coef = juxtZeroSum$coef, se = juxtZeroSum$se, pVal = round(juxtZeroSum$pVal,3), pVal_fdr = juxtZeroSum$pVal_fdr, pVal_bonf = juxtZeroSum$pVal_bonf, row.names = juxtZeroSum$Row.names ) write.csv(zeroSum, paste0(WORKING_DIR, "/Data/languages_zeroSum.csv")) # only store bootstrap if we have actually created it if (UNCERTAINTY) { bootstrap = data.frame( coef = juxtBootstrap$coef, se = juxtBootstrap$se, sig = juxtBootstrap$sig, sigCons = juxtBootstrap$sigCons, row.names = juxtBootstrap$Row.names ) write.csv(bootstrap, paste0(WORKING_DIR, "/Data/languages_bootstrap.csv")) }

remove(WORKING_DIR)

## 4. Analysis of the permutations of re-anaysis flags (permutations.Rmd)

`Rscript --vanilla generate_permutations.R`

When the permutaions have been geneated, we must analyze them. Clear the flags and set the working directory to permutations root:

source("implementation.R") flags = c( REMOVE_DUPLICATES = 0, REMOVE_TYPESCRIPT = 0, REMOVE_V8 = 0, USE_AUTHORS_INSTEAD_COMMITTERS = 0, UNCERTAINTY = 0 ) WORKING_DIR = "./artifact/permutations"

First, get the permutation results calculated previously:

`est`

= value of the estimate`se`

= standard error`pv`

=`NA`

if the result was not signifficant, p-value limit otherwise)

For bootstrap, we get `est`

, `se`

and `sig`

and `sigCons`

which are boolean vectors telling us whether the results were significant.

loadPermutation = function(b, name, permutation) { lWeighted = read.csv(paste0(permutation, "/Data/languages_weighed.csv")) rownames(lWeighted) = lWeighted$X lWeighted = lWeighted %>% dplyr::select(-X) lZeroSum = read.csv(paste0(permutation, "/Data/languages_zeroSum.csv")) rownames(lZeroSum) = lZeroSum$X lZeroSum = lZeroSum %>% dplyr::select(-X) bootstrapFilename = paste0(permutation, "/Data/languages_bootstrap.csv") if (file.exists(bootstrapFilename)) { lBootstrap = read.csv(bootstrapFilename) rownames(lBootstrap) = lBootstrap$X lBootstrap = lBootstrap %>% dplyr::select(-X) } else { lBootstrap = NULL } # and create the result list list( name = name, weighted = lWeighted, zeroSum = lZeroSum, bootstrap = lBootstrap ) }

Let's now look at all permutations and loat them:

loadAllPermutations = function(baseDir, b = baseline) { dirs = list.dirs(baseDir, recursive = F) result = list() for (d in dirs) { pName = strsplit(d, "/") pName = pName[[1]][[length(pName[[1]])]] if (nchar(pName) != length(flags)) next() tryCatch({ result[[pName]] = loadPermutation(b, pName, d) }, error = function(e) { print(e) cat("Permutation ", pName, "FAIL\n") }) } cat("Total valid permutations: ", length(result)) result } permutations = loadAllPermutations(WORKING_DIR)

Let's now create the skeleton for teh RQ1 table in the paper:

result = baselineFSE_RQ1() result = mergeDataFrames(result, baselineCACM_RQ1()) result = result %>% dplyr::select(FSE_coef, FSE_pv, CACM_coef, CACM_pv) repetition = permutations[["00000"]]$weighted %>% dplyr::select(repetition_coef = coef, repetition_pv = pVal) result = mergeDataFrames(result, repetition) cleaned = permutations[["11111"]]$weighted %>% dplyr::select(clean_coef = coef, clean_pv = pVal, adjusted_fdr = pVal_fdr, adjusted_bonf = pVal_bonf) result = mergeDataFrames(result, cleaned) zeroSum = permutations[["11111"]]$zeroSum %>% dplyr::select(zeroSum_coef = coef, zeroSum_pv = pVal_bonf) result = mergeDataFrames(result, zeroSum) bootstrap = permutations[["11111"]]$bootstrap %>% dplyr::select(bootstrap_coef = coef, bootstrap_sig = sigCons) result = mergeDataFrames(result, bootstrap) result

And output it to the rq1 table:

output_RQ1_table_repetition(result) output_RQ1_table_reanalysis(result)

remove(WORKING_DIR)

## 5. Missing Commits (missing_commits.Rmd)

This separate notebook deals with our analysis of commits not accounted for in the artifact obtained from the original authors.

# load the file containing the actual implementation details knitr::opts_chunk$set(echo = FALSE) source("implementation.R") initializeEnvironment("./artifact/missing-commits")

Load the data processed by the repetition and summarize them.

data = read.csv("./artifact/repetition/Data/newSha.csv") data$combined = data$combinedOriginal # does not matter data$devs = data$committer # does not matter data = summarizeByLanguage(data)

First, we download projects. This gets us the projects' metadata, all commits, and all unique files changed in the commits. Out of 728 projects, we downloaded 618 (causes: network failures during download, projects going private, or projects being deleted) and analyzed 513 (node.js, which we used to download the projects, segfaulted on several of them). The commits reported for the study were then analyzed; and for each project, we remember the list of commits used in the study.

```
Total records: 1578165
Total projects: 729
Multi-commits: 46526
Unique commits: 1531639
```

Multi-commits are commits that have multiple languages.

The downloaded projects were matched. Since the study has project names without repository owners, matches could be ambiguous. We end up with 423 matched projects. One item, dogecoin, has the same name but two different projects. For each project, we looked at all commits and classified them as:

valid (i.e. present in the study and in the project)

irrelevant (i.e. present in the project, but not relevant to the study since they do not change any file in the studied languages)

missing (present in the project, but not in the study, while changing at least one file in studied language)

invalid (present in the study, not present in the project)

This data has been obtained by running the `commits-verifier`

tool.

<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->

<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->

### 5.1. Results on missing commits

<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->

<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->

First some verification and data cleanup -- if we matched a project wrongly, then we would see all invalid commits.

mc = read.csv("./input_data/missing-commits.csv") mc %>% filter(invalid > 0) %>% arrange(desc(invalid))

There are four projects with invalid counts larger than 1. In the case of "framework" and "hw", the numbers are large enough to be worth setting them aside. The case of "generator-mobile", where we have only invalid commits, suggests a badly matched project. We ignore it. Lastly, "DefenitelyTyped" has 8 mistmatched commits -- but since it is one of the TypeScript projects that contains no code, we can safely ignore it.

mc %>% filter(invalid <= 1) -> missing_commits valid_sum <- sum(missing_commits$valid) check(valid_sum == 426845) missing_sum <- sum(missing_commits$missing) check(missing_sum == 106411) ratio_missing <- round(missing_sum/(missing_sum+valid_sum)*100,2) check(ratio_missing == 19.95) out("MissingCommitsThousands", round(missing_sum/1000,0)) out("MissingCommitsRatio", ratio_missing)

In total, we have seen 426k commits in the projects we have cross-checked. There were 106k missing commits (19.95%).

The number of commits per project is skewed towards very few valid commits

Invalid commits are in almost every project, and there are projects that are almost entirely missing

Projects with the highest ratio of missing commits:

missing_commits %>% mutate(ratio = round(missing/(missing+valid)*100,2)) %>% arrange(desc(ratio))

V8 is high on the list -- the 12th most incomplete project (around 70% of commits are missing).

`ggsave()`

disabled due to a strange font error —MPD

data %>% group_by(language) %>% summarize(commits = sum(commits)) -> commits_by_lang commits_by_lang[commits_by_lang$language == "C", 2] <- commits_by_lang[commits_by_lang$language == "C", 2] + commits_by_lang[commits_by_lang$language == "C++", 2] commits_by_lang %>% filter(language != "C++") -> commits_by_lang commits_by_lang %>% mutate(missing = 0) -> c c$language <- as.character(c$language) c[c$language == "C", 3] <- sum(missing_commits$cpp) #C++ and C together c[1, 1] <- "C/C++" c[c$language == "C#",3] <- sum(missing_commits$cs) c[c$language == "Objective-C",3] <- sum(missing_commits$objc) c[c$language == "Go",3] <- sum(missing_commits$go) c[c$language == "Coffeescript",3] <- sum(missing_commits$coffee) c[c$language == "Javascript",3] <- sum(missing_commits$js) c[c$language == "Ruby",3] <- sum(missing_commits$ruby) c[c$language == "Typescript",3] <- sum(missing_commits$ts) c[c$language == "Php",3] <- sum(missing_commits$php) c[c$language == "Python",3] <- sum(missing_commits$python) c[c$language == "Perl",3] <- sum(missing_commits$perl) c[c$language == "Clojure",3] <- sum(missing_commits$clojure) c[c$language == "Erlang",3] <- sum(missing_commits$erlang) c[c$language == "Haskell",3] <- sum(missing_commits$haskell) c[c$language == "Scala",3] <- sum(missing_commits$scala) c[c$language == "Java",3] <- sum(missing_commits$java) c %>% mutate(ratio = round(missing/(commits+missing)*100,0)) %>% arrange(desc(ratio)) %>% as.data.frame -> ratio_missing ggplot(data = ratio_missing, aes(x = reorder(language, ratio), y = ratio)) + geom_bar(stat="identity") + xlab("") + ylab("Percentage missing commits") + annotate("text", x = "Perl", y = 20, label = paste(ratio_missing[ratio_missing$language=="Perl",4], "%", sep = ""), color = "white") + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + coord_cartesian(ylim=c(0, 20)) -> p #ggsave(paste(WORKING_DIR, "/Figures/ratio_missing.pdf", sep = ""), p, width=5, height=2, units="in", scale=1.5) print(p) out("PerlMissingRatio", ratio_missing[ratio_missing$language=="Perl",4])

Perl is the outlier here, then Erlang, Go, PHP, and JavaScript.

remove(WORKING_DIR)

## 6. Commits Survey (commit_survey.Rmd)

source("implementation.R") initializeEnvironment("./artifact/commit_survey")

### 6.1. Get developers' labels into one combined dataframe

library(dplyr) library("fst") library("irr") inputPath = "./input_data/commit_survey" # read all files in a directory answers = list.files(inputPath) data = lapply(answers, function(x) { cat(x, "\n") read.csv(paste(inputPath, x, sep = "/"), header = F, sep =",") }) i = 1 while (i <= length(data)) { d = data[[i]] d$category = as.character(d$V2) d$category[is.na(d$category)] = 0 d$category[d$category != 0] = 1 data[[i]]$category = as.numeric(d$category) i = i + 1 } # if this is set to T, pulls data from the extra developers we had available for the survey as well - but these finished later, or did not finish at all so we only used their subset. if (F) { inputPathExtra = "./input_data/commit_survey_extras" answersExtra = list.files(inputPathExtra) dataExtra = lapply(answersExtra, function(x) { read.csv(paste(inputPathExtra, x, sep = "/"), header = F, sep =",") }) i = 1 while (i <= length(dataExtra)) { d = dataExtra[[i]] d$category = as.character(d$V2) d$category[is.na(d$category)] = 0 d$category[d$category != 0] = 1 dataExtra[[i]]$category = as.numeric(d$category) i = i + 1 } # join the single dataframe per file into a single dataframe with all of them data = rbind(data, dataExtra) } separateDevs = data data = do.call("rbind", data) # now aggregate according to the commit #data$category = as.character(data$V2) #data$category[is.na(data$category)] = 0 #data$category[data$category == "0"] = 0 #data$category[data$category != 0] = 1 data$category = as.numeric(data$category) data$test = rep(1, nrow(data)) sanity_check = aggregate(test ~ V1, data, sum) # we expect the sanity check to be three for all commits if(!all(sanity_check$test == 3)) cat("Error -- Not all commits have 3 votes\n") dd = data data = aggregate(category ~ V1, data, sum) data$score = data$category / sanity_check$test # 3 # sanity_check$test #data$score = sapply(data$category, function(x) mean(sapply(x, function (x) if ( x == 0) 0 else 1))) data$label = ifelse(data$score>=0.5, 1, 0) # This will never == 0.5 because we will have 3 votes significant <- data[sapply(data$category, length) > 1,] discord <- significant[(significant$score != 0) & (significant$score != 1), ]

### 6.2. Compare developers' labels to study's labels

stripGHUrl = function(what) { cat(what) substr(what, nchar(what) - 39, nchar(what)) } ourLabelsPath <- "./input_data/petrs_commits" buggycommits <- read.csv(paste(ourLabelsPath, "buggy_commits.csv", sep="/"), header=F) buggycommits$V1 <- as.character(buggycommits$V1) #buggycommits$V1 = sapply(buggycommits$V1, stripGHUrl) buggycommits$ours = buggycommits$V3 buggycommits$paper = 1 nonbuggycommits <- read.csv(paste(ourLabelsPath, "non_buggy_commits.csv", sep="/"), header=F) nonbuggycommits$V1 <- as.character(nonbuggycommits$V1) #nonbuggycommits$V1 = sapply(nonbuggycommits$V1, stripGHUrl) nonbuggycommits$ours = as.numeric(!nonbuggycommits$V3) nonbuggycommits$paper = 0 allcommits = rbind(buggycommits, nonbuggycommits) data$V1 = as.character(data$V1) allcommits$V2 = as.character(allcommits$V2)

# Calculate disagreement getPaperLabels <- function() { labels = rep(0, nrow(data)) for (r in 1:nrow(data)) { df2row = allcommits[allcommits$V1 == data$V1[[r]],] labels[r] = df2row$paper; } labels } getOurLabels <- function() { labels = rep(0, nrow(data)) for (r in 1:nrow(data)) { df2row = allcommits[allcommits$V1 == data$V1[[r]],] labels[r] = df2row$ours; } labels } data$ours = getOurLabels() data$paper = getPaperLabels() data$agreePaper = as.numeric(data$label == data$paper) data$agreeOurs = as.numeric(data$label == data$ours) #data$disagreeOurs = getDisagreementLabels(data, allcommits) #data$disagreePaper = getDisagreementLabelsPaper(data, allcommits) write.csv(data, file=paste0(WORKING_DIR, "/Data/commit_survey_results.csv"))

falsePositives <- nrow(data[data$paper == 1 & data$label == 0,]) falseNegatives <- nrow(data[data$paper == 0 & data$label == 1,]) out("commitsFalsePositives", paste0(round(falsePositives / nrow(buggycommits) * 100, 1),"\\%")) out("commitsFalseNegatives", paste0(round(falseNegatives / nrow(nonbuggycommits) * 100, 1), "\\%")) #cat(paste("Percentage of false positives (Paper vs Devs): ", falsePositives / nrow(buggycommits), "\n")) #cat(paste("Percentage of false negatives: (Paper vs Devs)", falseNegatives / nrow(nonbuggycommits), "\n\n"))

unanimous_falsePositives <- nrow(data[data$paper == 1 & data$score == 0,]) out("commitsUnanimousFalsePositives", paste0(round(unanimous_falsePositives / falsePositives * 100, 1), "\\%")) #cat("Percentage of false positives upon which developers unanimously agreed: ", unanimous_falsePositives / falsePositives)

PfalsePositives <- nrow(data[data$paper == 1 & data$ours == 0,]) PfalseNegatives <- nrow(data[data$paper == 0 & data$ours == 1,]) cat(paste("Percentage of false positives (Paper vs Us): ", PfalsePositives / nrow(buggycommits), "\n")) cat(paste("Percentage of false negatives (Paper vs Us): ", PfalseNegatives / nrow(nonbuggycommits), "\n\n"))

PDfalsePositives <- nrow(data[data$label == 1 & data$ours == 0,]) PDfalseNegatives <- nrow(data[data$label == 0 & data$ours == 1,]) cat(paste("Percentage of false positives (Devs vs Us): ", PDfalsePositives / nrow(data[data$label == 1,]), "\n")) cat(paste("Percentage of false negatives (Devs vs Us): ", PDfalseNegatives / nrow(data[data$label == 0,]), "\n"))

Let's look at how certain people are:

cat(paste("3 votes not a bug: ", length(data$score[data$score == 0]), "\n")) cat(paste("Likely not a bug (1 vote for bug): ", length(data$score[data$score == 1/3]), "\n")) cat(paste("Likely a bug (2 votes for bug): ", length(data$score[data$score == 2/3]), "\n")) cat(paste("3 votes for a bug: ", length(data$score[data$score == 1]), "\n"))

This is cool, but let's do this for buggy and non-buggy commits as determined by the paper:

buggy = data[data$paper == 1,] cat("Paper labelled as buggy\n") cat(paste("3 votes not a bug: ", length(buggy$score[buggy$score == 0]), "\n")) cat(paste("Likely not a bug (1 vote for bug): ", length(buggy$score[buggy$score == 1/3]), "\n")) cat(paste("Likely a bug (2 votes for bug): ", length(buggy$score[buggy$score == 2/3]), "\n")) cat(paste("3 votes for a bug: ", length(buggy$score[buggy$score == 1]), "\n")) cat("Paper labelled as non-buggy\n") nonbuggy = data[data$paper == 0,] cat(paste("3 votes not a bug: ", length(nonbuggy$score[nonbuggy$score == 0]), "\n")) cat(paste("Likely not a bug (1 vote for bug): ", length(nonbuggy$score[nonbuggy$score == 1/3]), "\n")) cat(paste("Likely a bug (2 votes for bug): ", length(nonbuggy$score[nonbuggy$score == 2/3]), "\n")) cat(paste("3 votes for a bug: ", length(nonbuggy$score[nonbuggy$score == 1]), "\n"))

And now with me

buggyp = data[data$ours == 1,] cat("Peta labelled as buggy\n") cat(paste("3 votes not a bug: ", length(buggyp$score[buggyp$score == 0]), "\n")) cat(paste("Likely not a bug (1 vote for bug): ", length(buggyp$score[buggyp$score == 1/3]), "\n")) cat(paste("Likely a bug (2 votes for bug): ", length(buggyp$score[buggyp$score == 2/3]), "\n")) cat(paste("3 votes for a bug: ", length(buggyp$score[buggyp$score == 1]), "\n")) cat("Peta labelled as non-buggy\n") nonbuggyp = data[data$ours == 0,] cat(paste("3 votes not a bug: ", length(nonbuggyp$score[nonbuggyp$score == 0]), "\n")) cat(paste("Likely not a bug (1 vote for bug): ", length(nonbuggyp$score[nonbuggyp$score == 1/3]), "\n")) cat(paste("Likely a bug (2 votes for bug): ", length(nonbuggyp$score[nonbuggyp$score == 2/3]), "\n")) cat(paste("3 votes for a bug: ", length(nonbuggyp$score[nonbuggyp$score == 1]), "\n"))

Now look at the developers and calculate their ratio of bugs found:

for (d in separateDevs) { cat(sum(as.numeric(d$category)) / length(d$category), "\n") }

Ok, what we can do, is to compute for each developer how often he is in opposition to the rest:

devOps <- function(index) { score = data$score names(score) = data$V1 d = separateDevs[[index]] dc = d$category names(dc) = d$V1 bugX = 0 nonBugX = 0 resBug = 0 resNonBug = 0 for (i in d$V1) { s = score[[i]] if (dc[[i]] == 0) { if (s == 2/3) nonBugX = nonBugX + 1 else if (s == 1/3) resNonBug = resNonBug + 1 } else { if (s == 1/3) bugX = bugX + 1 else if (s == 2/3) resBug = resBug + 1 } } c(bugX, nonBugX, resBug, resNonBug) / length(d$V1) } for (i in 1:10) print(devOps(i))

### 6.3. Calculating Cohen's Kappa

# the number of people in the survey numDevs = length(separateDevs) # now calculate a dataframe where we have aggregated commits and for each commit we have the category assigned by all of the developers (NA if the developer did not see the commit) ratings = NULL tmp = list() for (n in (1:numDevs)) { x = cbind(as.character(separateDevs[[n]]$V1), separateDevs[[n]]$category) colnames(x) = c("url", paste0("x",n)) if (n == 1) { ratings = x } else { ratings = merge(ratings, x, by = c("url"), all = T) } } # create matrix of only the ratings w/o the commit urls #ratings_matrix = as.matrix(subset(ratings, select = - c(url)))

Now that we have the matrix, we should calculate the Cohen's Kappa to determine the interrater agreement between the raters. Because Cohen's Kappe is only defined for 2 raters, we use the Light's Kappa, which is an average over all pairwise Cohen Kappas. The fact that different commits were reviewed by different reviewers further complicates this. We therefore create a submatrix for each 2 pairs of developers, remove any NA rows, calculate the kappas and then report the results and distribution:

kvalue = c() kpvalue = c() ksize = c() kfirst = c() ksecond = c() n = 1 for (first in (1:(numDevs - 1))) { for (second in ((first + 1):numDevs)) { tmp = cbind(ratings[colnames(ratings)[[first + 1]]], ratings[colnames(ratings)[[second + 1]]]) tmp = na.omit(tmp) if (nrow(tmp) != 0) { k = kappa2(tmp) kvalue[[n]] = k$value kpvalue[[n]] = k$p.value ksize[[n]] = nrow(tmp) kfirst[[n]] = first ksecond[[n]] = second n = n + 1 } } } summary(kvalue) summary(kpvalue) summary(ksize) out("commitsKappaMedian", round(median(kvalue), 3)) out("commitsKappaMin", round(min(kvalue), 3)) out("commitsKappaMax", round(max(kvalue), 3)) out("commitsKappaPValMedian", round(median(kpvalue), 3)) out("commitsKappaPValThird", round(quantile(kpvalue, 0.75), 3))

Ok, all of our kappas are positive. Most of the p-values are also very small.

boxplot(kvalue) boxplot(kpvalue) sort(kpvalue)

## 7. Threats To Validity (threats_to_validity.Rmd)

# load the file containing the actual implementation details knitr::opts_chunk$set(echo = FALSE) source("implementation.R") initializeEnvironment("./artifact/threats-to-validity")

### 7.1. Tests in the files

data = read.csv("./artifact/repetition/Data/newSha.csv") files = loadEverything()

Let's see how many of the files are tests. First clear the files dataset to only contain commits that are in the `data`

we analyze:

files = files %>% filter(sha %in% unique(data$sha))

files %>% filter(str_detect(file_name,"Test|test")) -> tests out("ratioTestsFiles",round(nrow(tests)/nrow(files)*100,1)) files %>% filter(!str_detect(file_name,"Test|test")) -> notests out("testFilesCommitted", nrow(tests)) out("ratioTestFilesCommittedOverAll", round(nrow(tests)/nrow(files)*100,1))