Hello World takes 30 lines of assembly?
Recently, I saw a programming meme on the BookFace so-called Social Media platform. It read "Ok ima Learn Assembly [sic] Damn Hello World is 30 lines" with an accompanying boxer first ready to fight, and then the said boxer taking a break with a water bottle.
Cards on the table, I am not a professional in any assembly language. I would say that I am most knowledgeable in Z80, and know enough in 6502 and MIPs to get by. And because I've not done any Z80 for a while (nor any other assembly for that matter), I'm a bit rusty with it. But I couldn't think how in Z80, or in 6502, or in MIPs, a Hello World program would take 30 lines of assembly. It may take 30 bytes, or words, but that's not the same. If you were learning assembly, you wouldn't be entering a program byte by byte (or word by word). That wasn't even a good idea 35 years ago.
So sure was I about this that I posted an example in 6502 with the target platform being the Commodore C64. My first try, although flawed, was 7 lines of assembly and 12 bytes of data. Whilst it worked fine to output 'HELLO WORLD' to the C64 screen (at 1024), it had a flaw that I didn't realise (as I didn't test the code before I posted it). It was only after testing that I realised the mistake, and posted a follow up which corrected my initial bug.
Some notes before I continue. For this example, I am using an online assembler found at nurpax.github.io/c64jasm-browser but other assemblers are available; if you want to try these examples on real hardware then I would recommend Turbo Macro Pro - some examples of how to use this are shown on Robin Harbron's excellent 8 Bit Show and Tell Youtube channel here. I recommend Robin's tutorials as he goes in to way more depth than I will be here.
Let's start from the dumbest example and work through it. In doing so, we may discover the elusive 30 lines of code issue highlighted by the meme.
* = $c000 ; Start address, call with
; SYS 49152 from C64 BASIC
lda #$48 ; 'H'
jsr $ffd2 ; Kernal CHROUT call
lda #$45 ; 'E'
jsr $ffd2
lda #$4c ; 'L'
jsr $ffd3
lda #$4c ; 'L'
jsr $ffd2
lda #$4f ; 'O'
jsr $ffd2
lda #$20 ; ' '
jsr $ffd2
lda #$57 ; 'W'
jsr $ffd2
lda #$4f ; 'O'
jsr $ffd2
lda #$52 ; 'R'
jsr $ffd2
lda #$4c ; 'L'
jsr $ffd2
lda #$44 ; 'D'
jsr $ffd2
rts
As already stated, this is a dumb example, and we get to 24 lines.
It is using the C64 Kernal to output each character to the screen, so whilst this has some advantages, in that when you call this with SYS 49152 it will output from the next cursor position, and your code returns you back to BASIC cleanly on the rts instruction, it isn't necessarily the best way to do things. As an aside, this will also work on other Commodore machines, certainly the VIC-20 and C128 in native mode, and probably all other Commodore 8-bits including the PET as long as you relocate the code to somewhere where there is free memory available to store it.
Using the Kernal may be slower than handling things yourself, and using the CHROUT in particular means that you cannot easily write to the whole screen, as when you get to the last screen position (on the C64 this is 2023, or $07e7 in hexadecimal) and output there with a jsr $ffd2, your screen will scroll either one or two lines, and therefore the top one or two lines will disappear. If you limited your text to 999 characters maximum, or outputted to a specific screen area then this might be useful, but you'd also have to use the Kernal to position the cursor correctly before writing to the screen, which again may be cumbersome in some instances.
A more pertinent point here is that if you were actually going to learn assembly, you would not write any assembly like this. Aside from the relative slowness of calling to the Kernal CHROUT, you also have repeating code as almost every other line is a jsr $ffd2, and one of the lda instructions is not necessary at all. In the above example, it loads the 'L' character in HELLO twice before calling the CHROUT routine twice; a small efficiency here is to simply load the L character value once and call the CHROUT routine twice. But even then, doing things this way isn't how you would learn assembly. In assembly, you would set up a conditional loop to read from an area in memory where each byte of the data is stored with your message (HELLO WORLD), and then iterate the loop until your condition to terminate it is met, and output each byte to your display. Based on the CHROUT example above, let's have a look how one might do this.
* = $c000
lda #$00 ; Set Accumulator to zero
ldx #$00 ; Set the X register to zero
lda $c01b,x ; Start of loop, reads the
;data to the Xth byte
cmp #$00 ; Have we hit our terminator yet?
beq *+9 ; If so, branch ahead 9 bytes
jsr $ffd2 ; CHROUT
inx ; Increment the X index
jmp $c004 ; Jump back to the start of our loop
rts ; Return to BASIC
* = $c01b ; Data at $c01b
; The data is split into multiple lines as
; putting it on a single line does not work
; well on the Blogger platform as the text
; overflows the design boundary and looks ugly
!byte $48, $45, $4c, $4c
!byte $4f, $20, $57, $4f
!byte $52, $4c, $44, 0
Now that looks better, and we're definitely not near the 30 lines supposed in the meme, more like 12 if the data bytes are written on a single line.
A few cautionary notes here. Firstly, whilst I am not using labels, they are very useful and make your development easier to manage, and allows any assembly program room to grow, as the labels will move as your code gets longer or shorter. I don't use them as the online assembler linked above gives me an instant disassembly of the program as I type it, so I am able to correct my assembly code as necessary. But for convenience, I have set the start of the code to $c000 (SYS 49152 as already mentioned), and the data to be stored from $c01b.
We have a much nicer example now, but there are still some things to improve: firstly, we have at least one line of unnecessary assembly: cmp #$00. This means we want to compare an absolute value (zero) with the current value in the Accumulator (A). Before this comparison we have loaded a value into A with lda $c01b,x. This is taking a byte from memory location $c01b offset by the current value of X. On our first iteration, this takes in the value $48, as the X register is zero, and therefore so is the offset. Our comparison is false, and the beq *+9 (branch if equals +9 bytes) does not happen. We then increment the X register with inx and jump back to the start of the loop with jmp $c004. And so the next iteration will load A from memory location $c01b offset by 1, and so on until X is 11 and our zero terminator condition is met. And once the condition is met, it branches 9 bytes ahead to the rts instruction, returning back to BASIC.
You may add more PETSCII bytes to the data block from $c01b as long as you don't have more than 255 bytes of data in the block including the zero terminator. For reference, the C64 character codes are here.
We are explicitly comparing the current memory contents to a zero value; in 6502 we don't need to do this. This is because when loading a value into the Accumulator, either directly, or by reading a memory address, it will set the zero flag if A happens to be zero. As the zero flag is set, we may do a $beq *+9 without the preceding cmp #$00 because a branch instruction will test against the zero flag unless you explicitly tell it not to. If our terminator had a value of 255 ($ff in hexadecimal), we would need to do a cmp #$ff statement before the branch instruction. By using an absolute value of zero as a terminator, we have saved one line of assembly, and two bytes.
This isn't the only improvement that we may make here; at the start of our example above, we load the Accumulator with an absolute value of zero (lda #$00), and then do the same with the X register. But to save another two bytes, we don't need to initialise A to zero; we only need to initialise the X register to zero as that is being used as our offset. As we have saved two byte, the jmp $c004 needs repointing as the start of the loop has moved up in memory by two byte. With these improvements, let's have a look at our new assembly listing.
* = $c000
ldx #$00
lda $c01b,x
beq *+9
jsr $ffd2
inx
jmp $c002
rts
* = $c01b
!byte $48, $45, $4c, $4c
!byte $4f, $20, $57, $4f
!byte $52, $4c, $44, 0
Surely now we're done. The code itself is now pretty efficient. But we can still save one byte in our main loop. Remember that the zero flag is set if you write something to the Accumulator that equates to zero? The same principle applies to the X and Y registers. After we call the CHROUT routine with jsr $ffd2, we then increment the value in X with inx. The first time this happens, X will of course hold a value of 1, and we know that X will never be zero as we only have 12 data bytes including the terminator. This means that the zero flag is never set by the X register in our example, so we may use the branch instruction again instead of jmp $c002. It is only possible to branch 127 bytes back in your code, or 128 bytes forward in your code. Our code is small, and the loop beginning at $c002 will only be 9 bytes back from a branch if not equals instruction that we're adding. This saves one more byte from our code! Therefore, our rts line, to get us back to BASIC, is one byte lower in memory and the beq *+9 needs repointing too. Let's have a look at our final version with this optimisation:
* = $c000
ldx #$00
lda $c01b,x
beq *+8
jsr $ffd2
inx
bne *-9 ; We know that the zero flag
; is not set as we have incremented
; the X register by 1 and also we
; are not reading in 256 bytes of data.
; When the X register is 255 ($ff) and
; is incremented, it wraps around to
; zero. This same rules apply to
; the Y register.
rts
* = $c01b
!byte $48, $45, $4c, $4c
!byte $4f, $20, $57, $4f
!byte $52, $4c, $44, 0
So there you go, we have a small and efficient hello world example in assembly, starting from a dumb example. I was going to go onto writing this to screen RAM at $0400 hexadecimal, or 1024 decimal, but this blog post is already long enough just covering the points that I wanted to, so I'll leave writing directly to the screen RAM for my next blogger update.
Thanks to Robin of 8 Bit Show and Tell mentioned above, and the other feedback I got from Mastodon. I received some very helpful comments about this blog post which has improved it. Note that we are able to make small and tight loops like this on the C64 with the CHROUT Kernal routine because this call preserves the Accumulator, X and Y values for us. Other Kernal calls may not do this, and will therefore require additional logic to track these values. But this is for a future blog. I've enjoyed re-acquainting myself with some 6502 again, and I'm glad it makes more sense now than it did the last time I looked at it.
If you don't want to wait for the next update, a really good C64 resource is available here, and there are plenty of 6502 resources just a few clicks away.