Cycle counting in the Atari ST

Cycle counting on the Atari ST

Version 1.0

http://pasti.fxatari.com

This documented may be freely distributed as long as is not modified in anyway.

Atari, ST, Motorola, or other names may be trademarks of their respective owners.

This article is about how to determine the exact number of clock cycles taken by a specific machine language code in the Atari ST. It assumes the reading has a fair knowledge about the ST architecture and the Motorola 68000 processor. Potential readers are advanced coders, emulator authors, or anyone interested in understand the finer details of the ST.

Introduction

Why to count exact cycles at all? There are many reasons. For the coder, it is because it is fun, because it allows achieving a goal that otherwise would be impossible, and also because it is possible. For the emulator author is, obviously, for reaching emulation as much accuracy as possible.

The 68000, unlike many later 16-bit or 32-bit processors, has an easily predictable timing. This allowed ST coders to realize some effects that were probably far and beyond the capabilities imagined by the ST designers.

The most common purpose of cycle counting in the ST, is perhaps for producing software overscan and the so-called “Spectrum-512” effects. Those are impossible without cycle accuracy.

First stage: Use the reference book.

The most fundamental reference for cycle counting is the official Motorola documentation. The usual name is “M68000 Programmer’s Reference Manual”. It is detailed, comprehensive, and (at the time of this writing) is available free of charge from the manufacturer web site. We’ll refer to it as PRM.

There are plenty of third party tables, books, etc. But I find the official manual extremely well written and invaluable. Depending on the edition, it might have a few errors or mistakes. And all of them have a couple of undocumented timings. The most important missing is probably the exact timing of the division instructions. See our article at (http://pasti.fxatari.com/68kdocs/div68kCycleAccurate.c) for a full detailed coverage.

For some time the ST community believed that this manual was all what was needed, and that instruction timing performed always “by the book”. There was no hint whatsoever in the Atari documentation to believe otherwise. Unfortunately it isn’t that simple.

Second stage: Round up to a multiple of four.

I don’t remember when and where I read about it for the first time. But at one point the ST community learned that there is one rule that should be followed. It is the (nowadays) well know “round up to 4”. The rule means that all instructions on the ST take a number of clock cycles that it is a multiple of four. If the “by the book” cycles is not already a multiple of four, then it is rounded up.

What is the reason for this rule? The reason is in the ST RAM and video architecture.

The ST doesn’t have dedicated video RAM. The programmer can locate video RAM anywhere on the main RAM as long as it is properly aligned. That portion of RAM, besides being written (or read) by the main processor, must also be read by the video subsystem to generate the actual video signals.

This produces a conflict. Standard RAM is not designed for having two masters. There is a special type of dual-port RAM designed for this purpose, and it is used in some video architectures. But even today it is very expensive, and it was absolutely prohibitive at the time for an architecture without dedicated Video RAM.

Some type of arbitration to RAM has to be designed therefore. There are several ways to implement this arbitration. The ST implementation takes advantage of the special characteristics of the 68000 bus cycle. Bus cycles on the 68000 take a minimum of four clock cycles. But most of the time during a bus cycle is spent for control and handshake. And RAM at the time was fast enough to perform an access cycle in half that time. So the ST interleaves CPU and Video access to RAM, two clock cycles for each. This is performed by MMU, one of the main custom chips in the ST.

In reality it is not just CPU and Video. The MMU distinguishes between two types of RAM accesses. The first is RAM access that is performed from the main ST bus. It includes standard CPU bus cycles, and DMA cycles requested by the DMA chip, Blitter or another DMA master. The second type of RAM access doesn’t reach the main bus. It includes, of course, Video access for Shifter, but also RAM refresh and in the STe, DMA sound as well. We’ll call this last type of access “internal”, and the other one “external”.

The MMU allocates two clock cycles for each type of RAM access, in a round-robin fashion. Two clock cycles for “internal”, and two for “external” access. There is no priority of any kind. Internal and external “slots” alternate constantly. Internal accesses are always generated by MMU itself, so it already manages them for always falling on an internal slot. But an access generated by an external agent, as the CPU, might attempt to produce a “misaligned” bus access.

As we mentioned already, a standard 68000 bus cycle takes four clock cycles. But the MMU assigns specific phases of the bus cycle to actually address and access the RAM. The exact implementation is not relevant, and it’s outside the scope of this article. For the purpose of our topic, what matters is that an external bus cycle must start aligned. Aligned in such a way that the assigned phases would fall exactly on the external slot mentioned in the previous paragraph.

If a misaligned access is attempted, MMU can’t allow it to process and inserts wait states to actually force an aligned bus cycle. Note that MMU doesn’t care if the “internal” slot is actually used or not, a misaligned access is always delayed and forced aligned.

Assume the CPU is performing a sequence of contiguous bus cycles. The first one might be misaligned and would be delayed. But all the rest of the bus cycles would be subsequently aligned and would perform at full speed. The 68000 is rather orthogonal and symmetric, most instructions take a multiple of four cycles, and they would naturally tend to perform an aligned bus access already. So the ST implementation is possibly an acceptable compromise between cost and performance.

But not every code would perform an aligned bus access. An obvious example is an instruction that takes 6 clock cycles such as CLR.L D0. The CLR.L takes 6 cycles according to the PRM, but because it would attempt a misaligned access, it would actually take 8 cycles after the bus cycle is delayed and aligned.

As we’ll see in the following chapters, the last sentence is not 100% accurate. Furthermore, it is not always true. But in most cases, a CLR.L would contribute 8 clock cycles to the sequence. And this is the cause of the “round up to four” rule.

Stage 3: Instructions might pair with each other.

Again I don’t know the exact dates. But for a long time, the ST community followed the above rules. “Check the nominal timing, and then round up to four”. At some point ST emulators were extremely accurate, and the authors found that something was wrong. They found out that a 6 cycles instruction not always takes 8 cycles, it depends on the code sequence. The reason is something that I called “pairing”.

As might be clear from the previous section, the key point is not precisely how many cycles an instruction takes; or if the instruction cycles are a multiple of four or not. The key point is the exact location of the bus cycles on the code sequence. Because what really matters, is if a bus cycle is performed in a four cycles boundary or not. It is only the alignment of the bus cycle what would determine if MMU would insert wait states or not.

So we should correct the rule to a more complex one:

A bus cycle must be aligned, in relation to the previous one, at a four cycles boundary. If it is not, it would take two clock cycles more.

On most cases, the simple round up rule would match, but not always. For this purpose, let’s consider first the following code sequence:

NOP

CLR.L DO

NOP

If we break up the bus activity of the above sequence, we get:

Cycle 0: NOP Prefetch

Cycle 4: CLR.L Prefetch

Cycle 8: CLR.L Internal processing, bus idle.

Cycle 10: NOP Prefetch attempt, delayed by MMU

Cycle 12: NOP Prefetch performed

Cycle 16: Next instruction.

The theoretical execution time of the above sequence is 14 cycles. But because there was a misaligned bus cycle, the actual number of cycles would be 16. This matches the simple rule of the previous section, which would arrive to the same total by rounding up the CLR.L execution time from 6 to 8 cycles.

One interesting point to note is that CLR.L above takes really 6 cycles, not 8. It is at the next NOP when the CPU gets waits states. And then the second NOP takes 6 cycles instead of 4. For practical purposes this really doesn’t matter, you normally don’t care about how many cycles a specific instruction takes. You care about the whole sequence, and about the exact timing of the bus cycles. However, this brings a very relevant issue.

The relevant issue here is that there is an implicit condition that makes CLR.L to take 6 cycles, and not 8. The condition is exactly when the CPU performs the bus cycle on that instruction. If you look at the execution times table on the PRM, it indicates for CLR.L on a register the following timing: 6(1/0). This means one bus cycle, which we already know it takes 4 clock cycles, and a total of 6 clock cycles. It is obvious that there are two clock cycles where the bus is idle, but which ones? The two first ones, the two last ones, or one at the start and one at the end?

The PRM doesn’t provide the answer. In this specific instruction the bus cycle is performed at the start, and the bus is idle at the last two cycles. But there are other instructions that take 6 cycles, where the order is reversed and the bus is idle at the start. There are no cases where the bus is idle one cycle at the start, and another single cycle at the end.

In the code sequence above, it wouldn’t matter too much if the bus were idle at the start or the end. In either case the total execution time would be 16 cycles. But it matters a lot when we combine them together. Let’s now consider the slightly more complicated following code sequence:

NOP

CLR.L DO

BRA.W target

target NOP

BRA.W takes nominally 10 cycles. The PRM tell us it performs two bus cycles, so again there are two idle clock cycles. In this case, and contrary to CLR.L, they are at the beginning of the instruction. With this knowledge, let’s see the bus activity of this sequence:

Cycle 0: NOP Prefetch

Cycle 4: CLR.L Prefetch

Cycle 8: CLR.L Bus idle for two cycles.

Cycle 10: BRA.W Bus idle for two cycles.

Cycle 12: BRA.W Prefetch

Cycle 16: BRA.W Prefetch

Cycle 20: NOP Prefetch

Cycle 24: Next instruction

The whole sequence takes 24 cycles, exactly as the nominal number. All bus cycles are aligned in a 4 cycles boundary. There are no wait states inserted by MMU. This breaks the round-up rule, which would compute a total of 28 cycles (4+8+12+4).

The rule is broken because CLR.L performs two idle clock cycles at the end, and BRA.W does it at the start. Executed one right after the other, we get four idle cycles in immediate sequence. And then the next bus cycle would naturally align in a four cycles boundary.

This would happen whenever we combine one instruction with two idle cycles at the end, followed with one instruction with two idle cycles at the start. We say that these instructions “pair” with each other, and we call this behavior “pairing”.

Note that pairing depends not only on the specific instructions, but also in the order, it is not symmetric. Also note that the pairing, or not-pairing behavior can happen inside a single instruction with multiple bus cycles and multiple idle cycle sequences.

Unfortunately the behavior of pairing depends in turn on the internal timing of each instruction. Computing the exact number of cycles then requires knowing, for each instruction, when exactly the idle cycles are performed. The current version of this document doesn’t provide an instruction table detailing the idle cycles location. We expect to provide such a table in an updated version.

State 4: Some bus accesses don’t need to be aligned.

Until not too long go, the ST community considered that the previous section gave the ultimate answer for counting cycles on the ST. We recently discovered, or more precisely rediscovered, that it is not accurate.

Going back to previous sections, let’s remember that what makes the ST timing to be different than nominal, is the need to arbitrate between concurrent CPU (or external) and Video (or internal) access to RAM. And that MMU inserts wait states as needed, to implement this arbitration and interleaving.

But what happens when a bus cycle doesn’t access RAM? What happens when ROM is accessed instead? Can a ROM access be performed concurrently with an internal RAM access? Does MMU still insert wait states for unaligned bus accesses?

The answer is in the schematics, together with a little knowledge about the ST chipset. We can see that the RAM is separated from the main bus by a few TTL chips. They are two buffers and two latches, all of them with tri-state capability. The TTL chips are not present in the STe board, but they are integrated in the newer MMU-GLUE combo IC. They are required because otherwise MMU wouldn’t be able to exactly align the RAM phases with the CPU phases. But yes, this also means that when RAM is not accessed, it is disconnected from the main bus.

Furthermore, MMU doesn’t manage access to devices in the main bus. In the specific case of ROM, it is managed by GLUE. So not only that there is no need to align a bus access to ROM. MMU can’t actually perform the alignment. GLUE has no reason to perform any alignment, and doesn’t even have the information to perform it.

Then we arrive to the (almost) final rule for counting cycles on the ST:

A bus cycle accessing the internal MMU data bus will always perform aligned, in relation to the previous access on the same bus, at a four cycles boundary.

The internal MMU bus is the one beyond the TTL chips, which connects to the RAM and Shifter. In other words, only access to main RAM and Shifter must be aligned. In particular, ROM access, either to internal TOS ROM or Cartridge performs without any wait states.

This means that code in ROM usually executes slightly faster than code in RAM. But please note that only actual ROM accesses are faster. RAM access performed from ROM code still would be aligned. And ROM access performed from RAM code is not aligned (but this rarely saves any clock cycles).

Lastly note that our discussion about RAM is about the main standard RAM. There are RAM expansions implemented as “Fast RAM”. Fast RAM access is not aligned and usually doesn’t require any wait states. But of course that video data can’t be located on Fast RAM.

State 5: Some I/O chips are slow.

Access to I/O chips fall in a completely separate class. Some I/O devices can be accessed at full speed and without any alignment, such as GLUE or MMU registers. Most other ones are too slow and some agent on the system would throttle the CPU.

The version of this document doesn’t include a table, or a complete description of the access timing to I/O chips. We’ll either update this document in a future version, or we’ll cover the subject in a separate article.

Conclusion

In resume, the process required to compute the number of clock cycles that a given code sequence would take is as following:

- Count the number of cycles for each instruction according to the PRM.

- Identify the exact location of bus and idle cycles.

- Align all bus cycle accessing RAM or Shifter.

- Add wait states for the slower I/O chips.

Note that a specific code sequence might take a different number of cycles depending on which exact locations (RAM, ROM, or slow I/O chip) are being accessed.

Notes to emulator authors

There doesn’t seem to be a simple, fast way to accurately emulate the cycles. Ideally you should follow all bus cycles, and adjust the clock cycles on each one of them. But I know several emulator authors that are far much better coders than me, and they probably would be able to figure out an efficient implementation. Anyway, an efficient implementation is beyond the scope of this document.

It doesn’t matter for this purpose, how do you start the alignment, if the internal slot is the first or the second one, or the absolute number of cycles since power-up or reset. The important procedure is that once you define an aligned bus cycle, then all the others must have the same alignment.