/*
** implementation.txt
**
** Copyright (c) 1997, W. Sheldon Simms III
*/

This file discusses some the implementation of the em65 65c02 emulator
at both and abstract process communication level and at the UNIX system
call level. Other files document different aspects of the em65 emulator
as follows:

"65c02.txt"          : how to write 65c02 programs that em65 can execute
"debugger.txt"       : how to use em65's built-in debugger
"environment.txt"    : how 65c02 programs can use emulated devices

Please send suggestions, corrections to sheldon@atlcom.net

This file is current with em65 version 0.6

--------------------------------------------------------------------------

*** Introduction ***

My goal in creating the em65 program has been to build an emulator that
provides a complete emulation of a simplified machine architecture. Em65
is only the latest in a series of emulators that I have written. The
first several were a series of Apple IIe emulators that I successively
refined in an attempt to make them run at full speed on the computer I
had available - my Macintosh Centris 650. These emulators were abandoned
after I found that full speed was not quite possible.(1)

Next I wrote an emulator that emulated an instruction set of my own
design. The instruction set was designed primarily for easy decoding in
software. I still have this emulator in a state of about 75% completion.
It runs programs but lacks usable emulated devices, execpt for a
virtual terminal.

The reason this emulator was never finished is closely related to the
reason I have written em65. I was writing this emulator on my Mac and ran
into a problem - I couldn't get concurrency of I/O. When the emulated
disk was accessed, the emulated CPU stopped until the disk I/O was
complete. This is because the emulated CPU wrote to or read from some
emulated I/O address, which in turn called a function to read or write
from a file that served as the emulated disk. Since file I/O blocks
programs on the Mac, the emulated CPU was effectively stopped during the
entire I/O process.

From one point of view this is nice, as far as programs running on the
emulated computer are concerned, I/O happens instantly between
instructions. However, it's not very realistic, and one of my goals in
writing the emulator was to provide an environment very similar to what
might be provided in a physical machine.

If anyone reading this is a Mac programmer you might be objecting "You
should have used the thread manager", or  "Newer SCSI managers allow
non-blocking disk I/O". I considered both. The thread manager is out of
the question because you can't call OS functions from a preemptively
scheduled thread. As for the newer SCSI manager, it seemed like more
trouble than it was worth and the resulting program wouldn't run on a lot
of Macs anyway, because they don't use the newer SCSI manager.

I never conclusively decided to drop that project, I just worked on it
less and less until I realized I hadn't touched it in more than six
months. Anyway, the idea has stuck with me after I installed NetBSD on
the same old Mac. When casting about for a program to write that would
be a good UNIX exercise, I remembered the last emulator. Since I also
knew about Andre Fachat's OS/A65 multitasking operating system for 6502
based computers, and I am quite familiar with the 65xx family or MPUs,
I decided to write an emulator that would have concurrent I/O and that
would run OS/A65. As of this writing, I have spent an average of about
30 hours per week for two weeks on the program and I am close to an
initial release version 0.6

In the discussion to follow, it helps to remember that I have two main
specifications for the emulator:

1) It will emulate a simplified, but still realistic, hardware
   architecture and it will do act like real hardware, meaning that the
   various subsytems will operate concurrently.

2) It will run the OS/A65 multitasking operating system and will be
   usable in real time as a real computer running a real operating
   system.

The second specification is important and closely related to the first.
If the second specification did not exist, I could have written the
emulator on a single-tasking system. Concurrancy can always be emulated
by having the virtual disk read or write a single byte between every
instruction executed on the emulated CPU. Complete emulation of
instruction cycle times and programmable timers can be done if you don't
mind the emulated CPU executing only a couple hundred instructions per
second. While this might be an interesting academic exercise, it wouldn't
be usable in real time.

NOTES:

(1) My last effort would run full speed under certain circumstances, but
    when there was a lot of screen I/O, as is the usual case for Apple II
    programs, the speed would drop to around 80% to 90% that of a real
    Apple IIe. No doubt the same program would run at several times the
    speed of a real Apple IIe on a current computer, but I'm no longer
    that interested in Apple IIe emulation. For one thing, other people
    have written emulators, so I don't have to. None of these other
    Apple IIe emulators run at full speed on my Mac either, so I don't
    feel too bad about my effort.

--------------------------------------------------------------------------

*** Implementation Overview ***

Because of my experience with the unfinished emulator under MacOS, the
very first issue I thought about when considering the design of em65 was
how to ensure the concurrancy of the various subsystems. Since the
program was to be written under UNIX the solution was straightforward:
subsystems that run concurrently are different processes. The way I
think about it is that the main process - the startup process - is the
CPU emulator, and that emulated devices are child processes of the CPU
process.

As of version 0.6, em65 has three emulated devices: the virtual terminal,
the virtual disk, and the virtual clock interrupt source, which I will
simply refer to as "terminal", "disk", and "clock" from now on. The
clock is not spun out into a separate process, since the UNIX kernel
is itself the concurrently running clock. The clock implementation will
be discussed in more detail later.

The basic architecture of em65 0.6, then, is a parent CPU process along
with two child processes, one each for the terminal and the disk as in
the following diagram:

                 inter-process               inter-process
                 communication               communication
 +----------+  <----------------  +-----+  <---------------  +------+
 | terminal |                     | CPU |                    | disk |
 +----------+  ---------------->  +-----+  --------------->  +------+


As the diagram shows, inter-process communication (IPC) is one of the
main requirements for implementation of this model. Em65 uses several
types of IPC as will be discussed later.

This basic architecture calls for the CPU process to execute 65c02 code
and to provde an interface for the terminal and disk that hide the
IPC that must occur so that it looks like regular 65c02 operations.
The way that this is done is that 65c02 accesses to a certain area of
its address space, the emulated I/O space, are trapped by the emulator
and turned into function calls that implement the IPC.

Generally, the IPC is handled as a message and response. For example,
a write to $EC00 (that's a hexadecimal address in the 65c02's 16 bit
address space) calls a function that sends a message to the terminal
process telling it that it should print a the character that was
written to $EC00. When the terminal process gets the message, it
responds, signalling that it received the message, and then prints the
character on the physical screen - at least in theory.

One complicating factor in the design came from another early decision
to include a 65c02 machine level debugger in em65. The main reason for
doing so was to make it easier to debug the emulation and of course it
helps immensely in writing 65c02 programs to run on the emulator and
will be a great help in porting the OS/A65 operating system to the
emulator. The debugger complicates the design because it is naturally
a part of the CPU process, but it needs to share the physical screen
screen with the terminal process.

Once again this is handled with messages sent between the processes. The
CPU process, and thus the debugger, is considered to own the physical
screen. The debugger displays its interface to the user and allows the
user to enter commands which single step the emulated 65c02, set
breakpoints, edit memory, or whatever. If the user selects a debugger
command that hides the debugger screen in favor of the terminal, the
debugger sends a message to the terminal process telling it that it may
now use the physical screen. The terminal process updates the screen to
reflect anything it "printed" earlier that never actually appeared and
then acts like a regular terminal, printing onto the physical screen when
it gets a character.

Even when the terminal process is allowed to write on the screen,
however, the CPU process still does keyboard input for the terminal,
sending it messages when keys are hit, and the debugger still owns the
physical screen. If some event occurs that requires the debugger to
display, such a breakpoint being hit, the debugger sends the terminal
process a message telling it to give up the physical screen immediately.
When the terminal responds, the debugger redraws its own screen and
the terminal process is once again limited to merely keeping track of
what it thinks the screen should like like, waiting for the time when
it can write on the physical screen again.

The disk process is a bit simpler, since it does not have to contend
with any other process for exclusive use of any resource. When the disk
process is started, it opens a file that is used as the virtual disk.
This file is accessed by the disk as a sequence of 512 byte blocks,
which represent "logical blocks". The disk emulation does not go so far
as to emulate the physical geometry of a hard disk, such as the number
of heads, cylinders, sectors per cylinder, etc. Instead, it emulates
an intelligent controller that provides an abstract view of the disk as
a sequence of logical blocks. This is done mainly to ease the programming
on the 65c02 side. Just for fun, the emulated disk controller does
include a command that causes the disk to return a bogus hardware
geometry. The file that serves as the emulated disk grows up to the
number of blocks on the emulated disk times 512 bytes. Currently, the
disk accepts 16 bit block numbers and the emulated disk, is therefore
32Mb in size.

For further ease of programming on the 65c02 side, I desired the disk
controller to be an emulated DMA controller. I wanted the CPU process to
be able to write commands to the disk telling it what operation to
perform and what address in 65c02 address space to use for it's operation,
and have the the disk read or write directly to the 65c02's memory. This
requires more IPC. For my first attempt at implementing the disk, I
completely forgot that the disk was a separate process and had it simply
write to the memory array, just as the emulated 65c02 instructions do.
Of course, the disk process had it's own copy of the data space so,
although it wrote to memory just fine, the emulated 65c02 never saw the
data. The details of the current implementation are discussed later.
For now, it suffices to say that IPC is used to read data from, and write
data to, the 65c02's memory.

Since one of the main goals of em65 is to run the OS/A65 operating system
and be usable in real time, it was necessary to provide em65 with a
clock interrupt that counts real "wall clock" time, not an emulated
time based on the instructions executed and their nominal cycle counts,
for example. This is more concurrancy, but no separate process was needed
since the UNIX kernel provides services that allow real time to be
counted. The kernel itself acts as the separate "process" for the clock.
The clock provides interrupts to the emulated 65c02 at a rate chosen at
compile-time (currently 30 Hz). The 65c02 can turn the clock on and off
by writing to $EC20 in its emulated I/O space.

This brings up the final topic of the overview, which is the implemen-
tation of 65c02 interrupts in the emulator. Since these emulated
interrupts come from all three of the processes they are also
implemented as IPC messages. In the case of clock interrupts the CPU
process sends a message to itself. The message sent specifies which
interrupt is being asserted. The CPU process checks for interrupt
messages in between execution of each instruction. Although the primary
reason it does so it ease of programming, it happens to work well with
the 65c02 which, even as a physical chip, puts off interrupts until the
currently executing instruction is complete.

--------------------------------------------------------------------------

*** Implementation in UNIX ***

This section of this file discusses how the abstract architecture
described above is implemented in UNIX. It is almost the case that the
entire answer can be given in one word: "pipes". However, pipes are
used in more that one way and there are some other things going on as
well.

When em65 is started, it uses the pipe() system call to create several
pipes. First, it creates the interrupt pipe. This pipe can be written
into by anybody using the function assert_interrupt(), and is read by
the function check_interrupts(), which is called after every 65c02
instruction is executed. The interrupt pipe is unique among the pipes
created for IPC in that the read descriptor of the pipe is set to
non-blocking. Although the 65816 MPU has a "wait for interrupt"
instruction for which a blocking read on the interrupt pipe might be
handy, I don't want the emulated 65c02 to wait for an interrupt after
every instruction, so the read end of the interrupt pipe is made non-
blocking by using the fcntl() system call. This way, check_interrupts()
will quickly return if no interrupt message has been sent and the CPU
can execute another instruction.

After creating the interrupt pipe, em65 creates two pipes for each
virtual device; a to_device pipe and a from_device pipe. reads on
both of these pipes are blocking. This important, because it makes
message passing occur more quickly and significantly reduces the load
on the CPU of the device processes. In fact, with blocking reads on
the to_device pipes, the device processes spend most of their time idle,
blocking for messages on the pipe. Since the clock is not implemented
as a separate process no pipes are created for it. However, it's
interface is exactly like those of the disk and terminal.

The updated diagram of the em65 architecture, then, looks like this:


                 to_terminal                 to_disk
+----------+ <----------------- +-----+ ---------------> +------+
| terminal |    from_terminal   | CPU |     from_disk    | disk |
+----------+ -----------------> +-----+ <--------------- +------+
     :                            :  ^                       :
     :                            :  |                       :
     :                            :  | interrupt             :
     :                            :  |                       :
     :                            -->|                       :
     :                               |<-----------------------
     ------------------------------->|


As useful as pipes are for the kind of IPC em65 is usually doing, they
are not very good for implementing the emulated DMA disk controller.
DMA is implemented by using shared memory. After creating all the of
pipes that will be needed, em65 allocates the entire 65c02 memory (64kb)
as a block of shared memory, using the system call shmget().

At this point, em65 is still only one process. All of the pipes have to
be created and the shared memory allocated before device processes are
created so that all of the processes will have valid copies of the access
variables (file descriptors for pipes, shared memory ID for the memory).

After the shared memory has been allocated, the terminal and disk
processes are created using the system call fork(). After the fork, the
terminal process immediately calls terminal_main() which reads messages
from to_terminal and handles them. The disk process similarly calls
disk_main(), although it first uses the system call shmat() to map the
shared memory block into it's own address space and points it's copy of
the pointer variable 'memory' at it's beginning.

After forking the device processes, the cpu process initializes the
virtual clock device setting up a per-process real time timer using the
system call setitimer(). Actually, an option is provided to use real
time or process virtual time for this timer. Real time is just that.
Process virtual time is "wall clock" time, but is only counted when the
cpu process is executing in user mode. Real time is the default and is
what is needed to run OS/A65 interactively. Process virtual time would
be used, for example, if you were profiling 65c02 code using the virtual
clock. Real time would be inappropriate in that case since time would
still be measured even when the CPU process wasn't running. Consequently
the profile results would be meaningless as the recorded times would
have little relation to what parts of the 65c02 code take the longest to
execute.

The setitimer() system call sets a kernel maintained per-process timer
which sends the SIG_ALRM signal (for a real time timer) to the process
when the timer expires. Therefore the CPU process has to catch this
signal. The clock interrupt is then implented simply by having the catch
function, catch_sigalrm(), call assert_interrupt() and thus place an
interrupt message in the interrupt pipe, which will be read after the
execution of the current 65c02 instruction is complete.

--------------------------------------------------------------------------

*** The details on pipe IPC ***

Now I can explain how the pipes are used to implement memory mapped
I/O. In the 65c02 emulator, there are two macros READ_MEMORY and
WRITE_MEMORY. These macros take a 16-bit address as an argument and
usually just read or write that byte of the shared memory block, which
is treated like an array of unsigned char with the declaration:

unsigned char *memory;

However, the reason these macros exist is that they contain a conditional
expression so that if the address specified falls the the range from
$EC00 to $ECFF, they don't read or write the memory array, instead they
use the low byte of the address to index arrays of function pointers
declared as follows:

typedef void (*wfunc)( unsigned short, unsigned char );
typedef unsigned char (*rfunc)( unsigned short );

wfunc iow_func[ 256 ];
rfunc ior_func[ 256 ];

Initially these arrays are filled with pointers to two empty functions
something like this:

void w_default( unsigned short address, unsigned char byte )
{
}

unsigned char r_default( unsigned short address )
{
    return 0;
}

In the initializtion of the virtual devices, em65 inserts pointers to
device pipe access functions in these arrays, so that for the 65c02 to
read or write certain addresses in the range $EC00 to $ECFF ends up
calling a function that writes to a to_device pipe.

For example, writing to $EC00 sends a byte to the terminal. The terminal
process is sitting in a loop looking (astractly) like

while( 1 )
{
    message = read( to_terminal );
    handle_message( message );
}

Since the pipe reads are blocking, the terminal process spends most of
its time idle, blocking in the read() system call, waiting for a message.
When the 65c02 writes to $EC00, it ends up calling a function through
the iow_func[] array with the byte written as an argument. The function
looks something like this:

void write_terminal( int message )
{
    write( to_terminal, message );
    read( from_terminal );
}

And is called through the iow_func[] array like this:

#define MESSAGE_PRINT 0x0100
#define MESSAGE_QUERY 0x0200
...

iow_func[ address & 0xff ]( MESSAGE_PRINT | byte )

When write( to_terminal, message ) is called, as above, the terminal
process receives the message almost immediately, since it was blocking -
waiting for a message. The terminal process then calls handle_message()
which looks something like this:

void handle_message( int message )
{
    switch( message & 0xff00 )
    {
    case ...

    case MESSAGE_PRINT:
        write( from_terminal, 1 ); /* acknowledge message */
        print_on_screen( message & 0xff );

    ...    
    }
}

and the character is printed, just as it would be on a real physical
terminal. Every communication through emulated memory mapped I/O is
handled just this way.

--------------------------------------------------------------------------

*** Quick guide to the source code ***

The purpose of this section is to give a very brief overview of how the
source code is organized. The source files are organized as follows:

cpu.c            - 65c02 emulation
em65.c           - initialization, main function, execution loop
cmdline.c        - command line argument and environment processing
tables.c         - tables used by cpu.c and debugger.c
debugger.c       - the 65c02 machine level debugger
disk.c           - virtual disk
terminal.c       - virtual terminal
timer.c          - virtual clock.

cpu.c contains a function for each 65c02 instruction, functions to
implement interrupts, and the function execute_one() which implements
one iteration of the fetch-decode-execute cycle.

em65.c is mostly initializtion code. Here all the pipes are created
and the device processes are forked. There is also cleanup code
that deallocates the shared memory and kills the device processes.
A loop calling execute_one() can be considered the core of the emulation.

cmdline.c processes command line arguments and environment variables
in order to allow the user to select various startup options.
**** INCOMPLETE, DON'T TRY TO COMPILE  ****

tables.c contains a table of function pointers that holds pointers to
all of the 65c02 instruction functions. instruction decode and
execution takes place in one step by indexing this table with the
opcode. It also contains tables used by the debugger to disassemble
the contents of the 65c02's emulation's memory.

debugger.c is the single largest source file. It implements the 65c02
machine level debugger and uses the curses terminal package for screen
display. debugger.c is the best organized source file since although
this a 'research' rather than 'production' project, the file got big
enough for me to organize and comment everything in a more 'production'
manner.

disk.c and terminal.c implement the two device processes. Each is
split into cpu process and device process sections. The cpu process
sections of each contain functions to write into the to_device pipe
for the device and read from the from_device pipe for the device.
The device process sections of each contain the device process
counterparts to those functions, plus the message handling loop.

timer.c implements the clock device. It looks very much like the cpu
process section of disk.c and terminal.c.

--------------------------------------------------------------------------




