Because of some interference from added manpages, you should become familiar with the switches to the man(1) command. In particular, I found that I had to use switches to access the normal manpages for common system calls like open(2):
man -l open man -s 2 open man -M /usr/man open
Since this is not a programming course, I will not be teaching what you can find in the man pages, but I will cover some of the general principles. If you have questions about the use of a particular system call, that is what office hours are for, though you should be aware that preference is given to people who have read the manpages before asking me. You can also contact any of the TAs by email, through a mail alias Matthew is setting up. Check the web page.
For the basics of how C puts the pieces together to make an executable
program, Rich has prepared a short tutorial,
which you might want to read if this is at all confusing to you.
Coding Standards
There are a few things we will be checking for in your code, and we expect
your style of coding to not interfere with that. We don't mean to impose
a particular style, but your code will need to be clearly readable. If
we are in doubt about any point of grading, it could even happen that
well-written code with well-chosen comments could have a larger benefit of
the doubt. However, for the present purpose it is enough if we can verify
by inspection that all system call return codes are tested for errors.
It is usually enough if
you have a consistent style of indentation, and limit yourself to one
C statement per line.
Verifying Correct Operation
One concern you might have about any project is determining when you have
got it right. I suggest that you develop a text file containing shell
commands you should expect your shell to accept, and compare the outputs
of your shell with those of one of the standard shells. Here's how.
Suppose you create a file called test.sh
, which contains your
(overly short) test suite:
echo this is okay 1 echo this is not okay 2 | sed s/not.//You can run this with bash like this:
[YourPrompt]$ bash < test.sh | tee out.bash this is okay 1 this is okay 2 [YourPrompt]$The tee(1) command displays the output on your terminal, and also directs it to a file, in this case
out.bash
.
You can now use the first parameter to your jsh
executable to suppress its prompt, and you should have a similar result:
[YourPrompt]$ jsh '' < test.sh | tee out.jsh this is okay 1 this is okay 2 [YourPrompt]$This saves the output in
out.jsh
, so that
you can compare the two outputs:
[YourPrompt]$ diff -c -s out.bash out.jsh No differences encountered [YourPrompt]$And you are now sure that your
jsh
produces
the same output as bash
. This is good,
because I will be doing exactly the same thing when I grade your assignments,
since bash
is my default shell.
We are going to make it easy on you in this way: all the inputs you will be tested against are going to be one of:
Your shell will be presented with lines that have errors in them. It will
not be presented with lines that are correct for standard shells, but which
use features you have not been asked to implement. For instance, we will not
present lines with redirection of standard error, in part because the
various shells use a different syntax for that. Likewise, your shell will not be
presented with shell builtin commands like if
or while
,
or with operators like ||
or &&
or with "here documents" introduced by <<
.
You are therefore free to execute or issue an error message for such lines.
On the other hand, you may well be required to deal with lines like these:
||| echo foo | sleep 1 ||| echo foo | cat < lab1.txtAll three of these contain at least one error, and your shell should detect them.
You will also encounter problems that are not obvious from the command line.
For instance, a non-executable may be named as if it were a UNIX command. You
will therefore encounter an error in starting the command. Look at error codes
EACCES
,
ELOOP
, and
ENOTDIR
for instance.
What we require for errors is
The redirection phrases can appear anywhere in a command. For instance,
consider the command cat -n test.sh
.
It should produce output like this:
[YourPrompt]$ cat -n test.sh 1 echo this is okay 1 2 echo this is not okay 2 | sed s/not.// [YourPrompt]$You would also expect
cat -n test.sh > outfile
to put that two-line output in the file. Did you know that you don't have to put
the redirection at the end of the command? Try this:
cat -n > outfile test.sh
or even
> outfile cat -n test.sh
. Your shell
will be expected to do the same.
Likewise, your shell should wait for all commands in a foreground pipeline before continuing, and if you like, it can report the completion of background jobs before printing its prompt (this is not a requirement).
Your shell should be able to find and execute UNIX programs anywhere in
your searchpath, and should pass in all the command-line arguments that are
in the command line. The searching part of this should be automatic because
you're going to use execvp
(2), but
you will have to keep track of the arguments yourself. Note that the
arguments list can be pretty big, and you can't predict its size in
advance, except that we're using a limit of 256 characters per input line.
Files and File Descriptors
UNIX (and Linux) keep track of files in two different ways. The raw file
is known to the program through a file descriptor, which
is an integer, usually small. When a new file descriptor is created, it
is always the smallest non-negative integer which does not already denote
an open file in the process where it is created. When a program is first
started, and its main
function is entered, there
are three open file descriptors: 0, 1, and 2. If nothing else happens,
the next file descriptor to be opened will be 3. The first three file
descriptors are by convention assigned the following roles:
If you should close file descriptor 1 and open another file, that file
will be assigned file descriptor 1. Likewise, if you close file descriptor
1 and use the
dup
(2)
to duplicate file descriptor 2,
system call, standard output will now go to wherever standard error output
was going. This may or may not be a change.
This is how you will accomplish redirection
and arrange pipelines when you start programs.
You may prefer to use the
dup2
(2)
system call; it's up to you.
Files, represented by file descriptors, can be buffered by wrapping the
file descriptor into a stream, represented by a FILE
pointer.
There are streams opened for you on the standard file descriptors, called
stdin
,
stdout
, and
stderr
. You will not have to do
anything about these, but you can use them for reading input and issuing
output.
Pipes
In UNIX, a pipe is a communications channel between processes. Pipes can
be created in several ways, but for our purpose, we'll be using the results
of the
pipe
(2)
system call. Both
Linux and Solaris create bi-directional pipes, but we won't be using them
that way.
The
pipe
(2)
function returns two file descriptors, which communicate with each other.
Anything printed to one of them is available as input from the other, subject
to a limit of internal buffering, often 5120 bytes. This is how you will
arrange for the inter-process communication required for a pipeline.
Consider the pipeline from our earlier example:
echo this is not okay 2 | sed s/not.//For this pipeline to execute correctly, the
stdout
of the
echo
command should be connected to the
stdin
of the
sed
command. This is accomplished by manipulating file descriptors as the
commands are spawned; we'll have to wait until we get there before covering
that.
fork
(2)
system call. This call is peculiar in that it is entered once, and yet
it returns twice. It will return once for your shell program, returning
the
PID
or "process ID" of the child process that is being spawned, and it will
return again with zero in the context of that child. In both cases, it
returns to the same place in your program: right after the call to
fork
(2). Accordingly, one appropriate
idiom for handling this is:
#includeYou can see that the return value from the#include pid_t pid; pid = fork(); if (pid == (pid_t)-1) { /* an error happened, and no child was spawned */ {do error processing} } else if (pid == 0) { /* the child executes here */ {prepare file descriptors} {execvp to switch the child to the correct program} } else { /* the parent (original) process executes here */ {record the pid; it's the ID of the child} {fix file descriptors} } /* the parent continues here */
fork
call allows you to do the right thing for each of the two processes.
C itself provides a number of functions that can operate on strings,
and while you should become familiar with them, you won't need them much
for this lab. The most commonly-used ones are all on the same man page,
which you can access with the command
man -M /usr/man string
.
Your shell is going to read commands from standard input. The command will be stored as a string, probably with a newline and a null at the end. That's easy, but what are you going to do with them then? You can probably find libraries that will help you somewhat, but you may find that learning how to use the library is as hard or harder than doing the parsing yourself. Here are some ideas that may help.
You can read the command into a buffer and break it into "words" pretty easily, by scanning over the characters with a variable of type "pointer to char". The loop is pretty simple: when you see a non-whitespace character for the first time, it's the beginning of a word, and you copy your pointer into a list (or array) of pointers to the words you've found so far. When you see whitespace again, change it to a null (binary zero), which is the way C programs usually denote the end of a string. We've promised that there will be whitespace around all the words and operators, so this will work for you. Keep going until you get to the end of the line. You can stop either at the newline or the null.
As a result, you now have an array of pointers to all the words in the command line. You can go back and parse these into commands by noting the positions of the pipe chars ('|'), which divide the pipeline into separate commands. Or you could have done some of that work while building the array. In either case, you should first identify all the redirection phrases in the pipeline ('<', '>', and '>>' words followed by a filename) and move the pointers to some other place; they can occur anywhere in the command (even first, before the program name) and this can otherwise be confusing. With these out of the way, the first word in each command is the program name, and all the others are arguments.
You may want to be sure that the buffer that you're using can
be retained a long time, in case the command is a background pipeline.
See next section discusses one way to do this.
Putting It All Together -- Data Structures
Because syntax errors have to be caught before you start any processes,
you will need a data structure to keep track of the pipeline being
constructed. I expect that for each input line, you should assume it's
a pipeline of an undetermined number of UNIX commands. A sensible idea
would be to represent this as a list, with one cell per UNIX command.
Each cell should contain information about the redirections encountered,
the arguments seen, and whether this command is the first, or last command
of the pipeline (maybe both). The arguments will probably be another list,
because you don't know in advance how many there will be; or you can
just use arrays of length 128. We have promised none of the lines
are going to be longer than 256 characters, and we can't specify more
than 128 arguments in that much space.
When you've scanned the line, and built the data structures, you can loop
through the commands spawning the child(ren) that are called for.
If the command is not backgrounded, you can begin waiting for these children to die. For background commands, however, you may want to retain at least some of this information (mainly the pids) so that you can verify that you are harvesting them all. Accordingly, you may want yet another list, this one keeping track of all the background jobs that have not yet finished.
Since you have all these lists, and their lifetimes are potentially very different from the lifetime of any of the variable scopes in your program, you have to be concerned with where you are going to store these data structures. Even keeping aside the variables sizes involved, local variables won't do because they would probably go out of scope too soon. Global variables aren't much better because of the variable sizes.
This is a job for
malloc
(3C)
and
free
(3C). These functions
allow you to get space that has global scope, of varying sizes, and
to return it to the system when you're done with it. This last point
is very important. You have to return storage that's used for
a pipeline when it is finished, because otherwise you run the risk
of running out of memory before you're done. To test for this, you
can expect that we will run your shell on an input file with a few
million commands in it. If you're not careful about resource leakage,
your shell will not be able to complete the entire script.
At Last, Starting the Pipeline
When your shell has verified the command line, and collected all the
information about the commands in the pipeline, it has to start the
programs and arrange for the redirection and pipelines.
This is surely the hardest part of the lab to get right.
Most of the magic is in the fork() call, which creates the new UNIX process, called the child, which will do the work of one of the commands in the pipeline. Fork() creates a new process which is a duplicate of the calling process in most ways; they are both running the same program (Jshell) at the same place (about to return from fork()) with the same contents of all local and global variables, the same file descriptors pointing to the same files, and almost everything else. The only differences that matter for this excercise between the calling (parent) process and the child process are different process id's, parent process id's, and the return value that will be passed from fork() when the processes are allowed to continue. The child will see a return value of zero. The parent will see a nonzero return value which is the process ID of the child. Since both processes return from the fork() call at exactly the same place, the code at that place needs to test the return value to tell if it should take the actions appropriate for the parent or for the child. In this way, the two processes correctly discern their identities as parent or child, and take the appropriate paths through the code.
Here's a sketch of how might go (there are other ways). For clarity I have omitted error handling, but you will have to include it.
LOOP through the commands: IF this is NOT the last command in the pipeline: CALL pipe to allocate two file descriptors for the pipe save one as PIPEOUT, one as NEXTIN END IF CALL fork IF this is the child: IF this is the first command in the pipeline: IF there is input redirection: close file descriptor 0 (stdin) open the redirect file (replaces stdin) END IF ELSE: close file descriptor 0 (stdin) dup file descriptor PIPEIN (it repaces stdin) close file descriptor PIPEIN (only need one copy) END IF IF this is the last command in the pipeline: IF there is output or append redirection close file descriptor 1 (stdout) open the redirect file END IF ELSE close NEXTIN (the next child needs it, not this one) close file descriptor 1 (stdout) dup PIPEOUT (replaces stdout) close PIPEOUT (only need one copy) END IF CALL execvp (only returns on error) ELSE (this is the parent process (Jshell)) record the child process ID IF this is NOT the first process in the pipeline: close PIPEIN (the child has it open, we don't need it) END IF IF this is NOT the last process in the pipeline: close PIPEOUT (the child has it open, we don't need it) move NEXTIN to PIPEIN END IF END IF END LOOPThat may seem pretty ugly, but your shell has been doing that (and more) for you for years. Aren't you grateful? Anyway, let's consider what is going on here.
During execution of this loop, the Jshell process maintains its usual file descriptors 0, 1, and 2, so that they are always inheritied by the children. When there is more than one child, there will be pipes open as well, and these are also inherited by the chilren. In addition, for the first and last child (if there's only one, it is both first and last) there may be redirection; in this case the filenames that Jshell has recorded are available to the child as well. It is up to the child to rearrange its file descriptors as needed between the fork() call and the execvp() call.
The Jshell process does not open the redirected files, although the code is in the Jshell program. This code is executed by the child process before it uses execvp() to switch to the program that is called for by the command line being executed. The Jshell process also does not manipulate its file descriptors to arrange the pipeline, other than to call pipe() to create the file descriptors the children will use for that purpose. As each child is called, there may be none, one, or two such file descriptors open.
When Jshell waits for completion of the foreground commands, it automatically reaps the processes involved, and no zombies are left behind. That part is easy. For background jobs, there are two main ways to do it. One uses signals, and the other does not. Signals are a bit tricky, and we'll be covering them in other labs anyway, so I'm not going to discuss them here. You can use them if you want to, but you have to deal with their trickiness on your own.
The other way is to use the wait3() or wait4() system calls to find out about children that have finished. You definitely want to use the WNOHANG option, so that you don't block waiting for background jobs; you should proceed even if there are none that have finished. Since you'll probably report the finished jobs as part of issuing the command prompt, you may as well do the testing there too. You will in this way make sure that you reap the zombies on each input command. You will have to somehow keep track of the pids which have completed for each background job, so that you know when they have all been reaped and it's time to report the background job as completed.
You also need to be careful about waiting for the completion of children involved in your foregound pipelines. You want to wait for all of them before you go on to the next line of input, but you don't want to do it too soon. Why not?
It has to do with the pipes involved in the pipeline. The commands in a pipeline are connected by a pipe, and the pipe has a limited capacity. If you start the first command and immediately wait for it to finish, it will do what work it can, but it may or may not be able to finish. In particular, if it generates more than 5120 bytes of output, and sends them into the pipe, only the first 5120 will fit in the pipe, and the child will block, waiting for something to empty the pipe. But this will never happen because your Jshell has not yet started the next command in the pipeline, and no other process is going to do it. So the first command is sending into the pipe and no-one is receiving. The receiving process won't be started until Jshell stops waiting, Jshell won't stop until the current process completes, and it won't complete until the pipe empties ..... This situation is called a deadlock. Two or more processes are waiting for the others to finish; as a result none of them will.
Your solution is to defer waiting for even your foreground processes
until the entire pipeline has been started, and all the commands
are running. In this way, there will be some process listening
to the output side of every pipeline, and work should be able to
proceed.
Debugging Notes
There are lots of techniques for debugging, and you probably already
know a bunch of them. So you may already know about putting in
print statements, and about using debuggers like gdb, but sometimes
things still go wrong, but they never seem to go wrong under the
debugger, and you haven't been able to figure out your problem from
the print statements.
Sometimes that's because you just don't have enough print statements, possibly because there's too much output for you to be able to sort through it while the program runs. That's a job for script(1). This is a command that captures the contents of an interactive session for later examination. Check out the man page.
Sometimes you need to know what system calls are being used, and in particular, if your program gets hung, you want to know what system call is being executed. The strace(1) tool may help you with this. This will log all of your system calls, and can even log the system calls of your children. Check out this man page too. Just be aware that if you hang in a system call, the very last one may or may not be logged; I honestly don't remember. But strace has been very good to me over time, you probably want to make friends with it.