I was looking for an interesting hobby project to hone my AVR assembly skills when I found the Arduboy, a sturdy “Arduino” I could keep in my pocket.

There were a lot of fun games already and even some in 3D but no filled-polygon games? This had to be rectified.

First, what do we have to work with?

The Arduboy

The Arduboy is a portable game console compatible with the Arduino IDE and uses the same microcontroller as the Arduino Leonardo, an ATmega32u4 but unlike the Leonardo board it comes with hardware pre-attached all in a nice sturdy casing.

The ATmega32u4 contains an 8 bits CPU, a tiny bit of RAM, some flash ROM for the program and data, a tiny EEPROM for saving data when the power goes out, and features a bunch of GPIO pins to connect to external hardware.

CPU8bits ATmega32u4
1KB taken by the display buffer
1.5KB left to work with
Flash ROM28KB For the application
4KB For the boot loader
32KB Total
Screen128×64 1bpp monochrome OLED
ButtonsD-pad and 2 buttons
Sound1 bit x 2 channels (or 3 levels.)
SpeakerPiezoelectric speaker
Software Controllable LEDs1 RGB, 1 Green, 1 Yellow
External IO Ports1 micro USB port. Device mode only.

The display is 128×64 1bpp which means that a full frame takes away 1KB of our memory. Unless we’re generating the display on the fly (which isn’t practical for a polygon-based 3D game, I tried. More on that later) we’re left with 1.5KB of RAM for all our working data including the CPU stack.

It is connect to the MCU via one-way SPI and we don’t get the v-sync signal from the OLED controller as that particular display manufacturer did not expose the signal on the ribbon cable. <jackie_chan_wtf.jpg>

There is no audio generation circuitry. The sound has to be generated in software through an IRQ for each PWM pulse. Eating away at our precious CPU time.

The early tests

In late May 2018 when I got my unit I started with a filled-polygon rasterizer to see if it’d be possible to get anything at a playable speed.

Then I started experimenting with different game concepts. To see which styles would fit the OLED display better, the engine’s performance, ROM space, limited controls, etc.

I settled for a concept based on Nintendo’s Star Fox on the SNES.

16 bits8 bits
3 CPUs (65816, SuperFX, SPC700)1 CPU
ROM: 1024KBROM: 28KB
Main RAM: 128KB
Video RAM: 64KB
Audio RAM: 64KB
SuperFX RAM: 256KB
Total RAM: 512KB
10Mhz (SuperFX)16Mhz (Ha! Take that!)

Saving time and preparing for the inevitable debugging ahead

Debugging on MCU hardware is never as convenient as on your favorite workstation in your favorite IDE with all the nice tools like boundary checkers and log files.

So I moved the project out of the Arduino IDE and into my own Makefile-based build system with a native PC port so I could do as much testing on PC without having to program the flash every time.

It also lets me do some convenient time-saving things like showing the bounding boxes and bezier paths in a level preview window and live-editing the levels and sounds.

Starduino – Debugging on PC

If you’re planning on spending months of your spare time working on a game invest time in making your life easier first. Doesn’t take long to pay off.

Throw out what you don’t need

Moving out of the Arduino IDE let me throw out the USB library and a few others pieces. But that meant I didn’t have access to the USB serial port for debugging…

Remember that part about having the engine & game running on PC too?

I’ll live.

The OLED Display

By default the display controller is in “horizontal mode” which is fine for 2D blitting but isn’t convenient for rasterizing polygon.

Luckily the controller has a “Vertical Addressing Mode” which turns the memory mapping into a familiar and rasterizing-friendly 64×128 screen (sideways) of 8×128 bytes starting at the top left, going right, with the lsb on the leftmost pixel (0x01 = leftmost, 0x80 = rightmost).

The initialisation commands used:

static const uint8_t init_cmds[] PROGMEM = {
0xA8, 64-1,
0xD3, 0x00,
0x8D, 0x14,
0xDA, 0x12,
0x81, 0xFF,
0xd9, 0x22,
0xDB, 0x00,
0x20, 0x01,

The great thing about making a 3D engine is you don’t really care which way the screen is. You can just fix this in the projection matrix.


I’m not going to get into the details of rasterizing polygons as there’s already plenty of books and tutorials about this both online and in dead tree format. Pick your favorite.

As long as texturing isn’t involved the CPU can handle it just fine. With 32 registers it’s very apt at this job.

Sky Rendering

However I used a little trick for the sky box. It’s actually a dynamically generated 2×96 texture drawn over the screen in a skewed fashion according to the camera’s angle by projecting the horizon onto the edges of the screen to figure out the height and slope.

The lines representing the ground are drawn onto the background pattern before clearing the frame buffer with it. Their position is calculated according to the projection matrix.

Let’s butcher some 3D models

ROM space is at a premium here so obviously data needs to be compressed somehow. We also don’t have much RAM to work with so everything will have to be decompressed (or generated!) on the fly.

And a balance has to be struck between the saved data and the decompressor code. Each instruction on the CPU is 2 bytes and this can add up quickly.

I didn’t want to model 3D objects by entering vertex values by hand so I made an OBJ to C++ converter-compressor tool and modelled the objects in Modo.

The first obvious place to save bytes is to mirror your models whenever possible so the compressor looks for matching mirrored polygons on all 3 dimensions (X,Y,Z) and merge them into mirrored batches.

This means all I need is 3 bits to tell the decompressor if it has to repeat the poly batch mirrored on X, Y, and/or Z.

This all fitted nicely into a 16bit batch header VVZYXIIIIICCCCCC where

  • V is the number of vertices per poly-1 (dot, line, tri, quad),
  • ZYX are the mirror bits,
  • I is the index offset of vertex 0 in the batch,
  • C is the number of primitives to render in the batch.

The engine uses 32bits signed fixed point math in a 16.16 format (16 bits of fraction). This meant 12 bytes per vertex but this is much too wasteful.

That’s when floating points become your friend.

But wait! Floats are 32bits! That’s still 12 bytes!

Yes, standard floats are but I’m not using that.

The trick I used was to encode the vertices as 4bits of significand and a shared 4 bits of exponent, for a total of 2 bytes per vertex.

Modern GPUs use a similar trick to encode HDR colors such as RGB9_E5_EXT in OpenGL in only 32bits.

void VectorUnpack(Fixed *out_v, const uint16_t *v)
uint16_t t = pgm_read_word_near(v);
byte e = (t & 0xF) + 8;
int32_t *dst = (int32_t *)out_v;
for(byte i=3; i--;){
int8_t tx = (t & 0xF0);
tx >>= 4;
*dst++ = ((int32_t)tx) << e;
t >>= 4;

This is where the 3D models get butchered by the compressor tool but with careful adjustment of the vertices I could get the models to look fine even after rounding. Beggars can’t be choosers.

Each poly has a 4bit “color” index into a 16 entries palette that maps to 9 different dithering patterns. The indirection allows me to create the pulsing brightness animation and blinking.


Models Animated

Usually animations in 3D engines are done using bone matrix transforms with complex bezier curves but this was going to be much too large for this project and the poor little CPU could not have dealt with multiple bone matrices in any reasonable time frame.

So animations were done using simple vertex interpolation instead. When vertices are only 2 bytes (and mirrored!) it becomes more reasonable for the few meshes that would get animated.

Bounding boxes and collisions

Ugh! more bytes, right?

Nope. Not even going to store any bounding boxes.

All the objects have a scaling value to reuse the meshes and the bounding box was implicitly 1x1x1 centered around the object’s origin. (-0.5 to 0.5) then scaled by the instance’s scaling.

Complex collisions were created using invisible objects.

Extra invisible collision objects

You’d think the extra invisible objects would add more bytes than storing bounding boxes, however the extra code complexity necessary to handle multiple bounding boxes of varying sizes and positions per objects out-weights the savings.

The level

The level was encoded using spawning sections that are indexed to be reused.

Level sections in red-green pattern

The game contains a single intro level where elements are gradually introduced to let new players learn the controls.

After which a random level generator picks from 3 pools of level segments:

  • Dangers
  • Breaks
  • Energy Refills

When the game starts only the first few segments of each category can be selected by the RNG.

As the score goes up the game:

  • Accelerates from 1x at 0 points to 2x at 128K points.
  • The more dangerous segments from each pool get enabled.
  • More dangers get selected before breaks are picked.
  • More breaks get selected before an energy refill gets picked.

The random number generator is seeded with a constant so the same level gets generated every time to keep the high scores fair.

Text rendering

The font is encoded as 5×3 pixels per glyph into a 16bit value.

The glyph is first unpacked into an 8×5 buffer (remember the screen is on its side).

A style buffer tells the text rendering routine how to render the multiple text passes:

  • Expand Border
  • Fill 8×5 Cell
  • NAND Mask Draw (black)
  • XOR Draw (invert)

Drawing white text is a mask to set to black followed by an invert.

It seems wasteful but the black outline step takes care of clearing to black so the 2nd pass only needs to invert.

This allows various styles using the same text routine and saves on code space.

Reducing the lag

To help controlling the game when the frame rate dips the inputs are read 60 times per second (same rate as the game play logic) and stored into a circular buffer so inputs are processed at their input time.

Internally the game simulation runs at 60 ticks per second regardless of frame rate.

The killer math

Doing 32bit fixed point 3D math really kills that poor 8bit CPU.

Around 50% of the time is spent doing the geometry transforms, another 30% for the clipping.

I had to write the 16.16 x 16.16 -> 16.16 fixed point multiplication code in assembly to get a decent frame rate.

The main advantage other than not doing the ridiculously overkill 32 x 32 -> 64 fixed point math in C++ was that I was able to skip some of the long-multiplication steps as 3D vertex*matrix transform has a lot of zero bytes the way the models are rounded off so the assembly code skips an entire (byte * 4 bytes) multiply-accumulate chunk of code for each zero bytes.

Step 1 and Step 4 (not pictured) can be entirely skipped in this example.

The fastest routine is the one that does nothing. This was the biggest gain in speed.

The tally

Data3220 Bytes
Code25232 Bytes
Frame buffer1024 Bytes
Work memory1314 Bytes
Stack left222 Bytes

Download the .arduboy file (98KB Turbo version) (Classic version: .arduboy file (100KB) ) to flash the game on your Arduboy.

Or the full .zip file with the printable PDF instruction booklet (600KB) and contains the .hex file to upload using avrdude or your method of choice.

Direct link to the .hex file (Turbo Version with 50% higher FPS) 

Direct link to the .hex file (Classic Version)

Or if you need only the instruction booklet (516KB).

If you’re gonna spoof something, go all the way 😀

Leave a Reply

Your email address will not be published. Required fields are marked *