Translating MATLAB software to R

my experience with rBAPS

Waldir Leoncio

Oslo Centre for Biostatistics and Epidemiology, UiO

2024-05-30

ExcusesDisclaimers

  1. This project started in 2019, a.k.a. the year 3 BC copilot
  2. Total project time, so far: 450 hours
  3. I was raised by (programming) wolves
    • BS + PhD in Statistics (compute-heavy thesis)
    • Statistician + affinity for computers = sloppy programmer 😳
    • Knowledge of science + knowledge of software development = great RSE! 😎

The job

Translate core functionality of BAPS from MATLAB to R

  • Initially, 5 273 of 42 461 LOC (i.e., 4 modules)
  • Replace/incorporate some related features

BAPS

  • Bayesian Analysis of (genetic) Population Structure
  • About a dozen methodological papers for BAPS (early 2000s)

OG team

The foreseen challenges

Language issues

  1. I was raised as a 100% R guy (i.e., I’d only heard of MATLAB)
  1. Some of the codebase looks like this:
global POP_LOGML

if ready == 0
  if vaihe == 1
    roundTypes = [1];
  elseif vaihe == 2
    roundTypes = [2 1];
  elseif vaihe == 3
    roundTypes = [5 5 7];
  elseif vaihe == 4
    roundTypes = [4 3 1];
  elseif vaihe == 5
    roundTypes = [6 7 2 3 4 1];
  end
end

% TALLENNETAAN
npops = poistaTyhjatPopulaatiot(npops);

The legacy

  • ☹️ Original dev team (5 PhD students) had long disbanded
  • 😀 Methodological support from Jukka and XP from Gerry were priceless
  • 😭 BAPS 6 (latest version) has no CLI, so debugging = lots of clicking around

The plan

  1. Run the original BAPS to get acquainted
  2. Translate software to R, acquire XP on the go
  3. Publish when it’s ready (“5k LOC? Should take a couple of years, LOL”)

The unforeseen challenges

(buckle up, kiddos)

Getting the original BAPS to run

  • Can’t run the old binaries
  • Can’t recompile the source code (skill issue)
    • Eventually got the original working in Jan 2024
    • Then the official demo wouldn’t run anymore
  • Until now, much manual running of single MATLAB functions (bottom-up)

Top-down vs. bottom-up programming

Go from top-level functions down the rabbit hole or the other way around?

Aspect ⬇ Top-down ⬆ Bottom-up
Chance for errors Higher Lower
Chance to waste coding Lower Higher
Initial speed Fast Slow
  1. Started bottom-up
  2. Got annoyed at 🐢 development speed, switched
  3. Things broke spectacularly, switched back.

Result: 4 years of “one step forward, one step back”

Coming to grips with the MATLAB IDE

<rant><matlab>

“WTF are these shortcuts? Alt+W to copy? Ctrl+Y pastes?”

FWIW this is all fixed in MATLAB Online

Things got way better using MATLAB on VSC

MATLAB is weird (for an R user)

No default values for function arguments? WTF is this nargin(), nargout()?

Output-based overloading

>> a = max(10:-1:1)

a =

    10

>> [a, b] = max(10:-1:1)

a =

    10


b =

     1

Is abc() a function call or a subset?

if testaaPop(line3)==1
  %2 rivi tällöin lokusrivi
  nloci = rivinSisaltamienMjonojenLkm(line2);
  line4 = fgetl(fid);
  if isequal(line4,-1)
      disp('Incorrect file format'); fclose(fid);
      return
  end
  if ~any(line4==',')
      % Rivin nelj?täytyy sisältää pilkku.
      disp('Incorrect file format'); fclose(fid);
      return
  end
  pointer = 1;
  while ~isequal(line4(pointer),',')  %Tiedetään, ett?pysähtyy
      pointer = pointer+1;
  end
  line4 = line4(pointer+1:end);  %pilkun jälkeinen osa
  nloci2 = rivinSisaltamienMjonojenLkm(line4);
  if (nloci2~=nloci)
      disp('Incorrect file format'); fclose(fid);
      return
  end
end

</matlab><r>

R vectors are just weird

Object
print()
dim()
1:4
[1] 1 2 3 4
NULL

R vectors are just weird

Object
print()
dim()
1:4
[1] 1 2 3 4
NULL
t(1:4)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
      
[1] 1 4

R vectors are also weird

Object
print()
dim()
1:4
[1] 1 2 3 4
NULL
t(1:4)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
      
[1] 1 4
t(t(1:4))
     [,1]
[1,]    1
[2,]    2
[3,]    3
[4,]    4
      
[1] 4 1

</r></rant>

LPT: looks can be deceiving

>> A = [1 2 3; 4 5 6]

A =

     1     2     3
     4     5     6

R> (A <- matrix(c(1, 4, 2, 5, 3, 6), 2))
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
>> sum(A)

ans =

     5     7     9
R> sum(A)
[1] 21

This alone was the cause of multiple elusive bugs

The present and the future

After 450 hours of work:

  1. A still broken R package on GitHub: rBAPS
  2. Preservation of the original code on GitHub: BAPS
  3. Restoration of original webpage’s content on the GH Wiki
  4. A translation layer published on CRAN: matlab2r

How I’m working now:

  • (Almost) line-by-line comparison of objects
  • Adaptation of original code to run on terminal (1 software became 2 ⚰️)
  • Eventually: convert global variables to local (for CRAN) 💣

Three Lessons learned

After 4 years working as an RSE

1. Publishing papers vs. software

Understand and disseminate the differences.

Product Chances to publish Patches
Research paper “one” “embarrassing”
Research software “infinite” good
  • Make your MVP as M as possible (4 modules is 3 too many; no extras)
  • Don’t give estimates (or give it in hundreds of hours, if you must)
  • Release ASAP to avoid overhype
  • No such thing as bug-free software

2. Don’t hate the language, hate the IDE

  • Use whatever tool works best for you (even Emacs)
  • But don’t be afraid to explore new things!
    • 2005-2019: “R and RStudio are all I care about”
    • 2020-2024: “I love all languages (even MATLAB). VSC and vim are all I need!”
    • 20XX: ?

3. Future-proof your ideas: make them public

  • It’s a scientific community, after all
  • Relax, nobody’s gonna steal your idea
  • And if they do, that’s what OS licenses are for

Publishing your code actually protects it!

Commit and push timestamps your work

  • Write object names Samuel could easily pronounce
  • Everybody is embarrased of code they wrote years months ago!

Thank you for your attention!

Thank you for hosting us! ❤️ 🇫🇮

Github: wleoncio, ocbe-uio