Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create troubleshooting.md #2324

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 52 additions & 4 deletions build/deploy-ftp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,21 +18,68 @@ echo "login $USER" >> "$NETRC"
echo "password $FTP_SECRET" >> "$NETRC"
chmod 600 "$NETRC"

# create_archive will create a tar ball from the out folder
# try to extract it to see if it is valid
# retry 3 times then fail
create_archive{
MAX_RETRIES=3
RETRY_COUNT=0

while (( RETRY_COUNT < MAX_RETRIES )); do
(cd out; tar czf $EMULATOR.tgz $EMULATOR)

# Verify the tarball
if tar tf $EMULATOR.tgz >/dev/null 2>&1; then
echo "Tarball is valid and can be expanded."
break
else
echo "Tarball is not valid or cannot be expanded. Retrying..."
RETRY_COUNT=$((RETRY_COUNT + 1))
rm -f $EMULATOR.tgz
continue
fi
done

if (( RETRY_COUNT == MAX_RETRIES )); then
echo "Failed to create a valid tarball after $MAX_RETRIES attempts."
return 1
fi
}

# upload_file tries to upload the tar ball to the FTP server, will retry 5 times and then fail
upload_file(){
(cd out; tar czf $EMULATOR.tgz $EMULATOR)
MAX_RETRIES=5
RETRY_COUNT=0

echo "Deploying as $USER at $HOST"
while (( RETRY_COUNT < MAX_RETRIES )); do
echo "Deploying as $USER at $HOST"

ftp "$HOST" <<EOF
# Attempt to upload the file
if ftp "$HOST" <<EOF
passive on
type image
cd $DIR
lcd out
put $EMULATOR.tgz
bye
EOF
then
echo "Upload successful."
break
else
echo "Upload failed. Retrying..."
RETRY_COUNT=$((RETRY_COUNT + 1))
fi
done

if (( RETRY_COUNT == MAX_RETRIES )); then
echo "Failed to upload the file after $MAX_RETRIES attempts."
return 1
fi
}

# test_archive_integrity will download the tarball after successful upload and verify its integrity
# by binary comparing the contents of the source folder and the expanded folder
test_archive_integrity(){
echo "Testing download of $EMULATOR.tgz"
mkdir -p "$TESTDIR"
Expand Down Expand Up @@ -78,8 +125,9 @@ test_archive_integrity(){
fi
}


# main loop
while [ $retry_count -lt $RETRY_LIMIT ]; do
create_archive
upload_file
if test_archive_integrity; then
echo "File integrity verified successfully."
Expand Down
211 changes: 211 additions & 0 deletions troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
# Troubleshooting guide

This guide will help you sort out the most common issues of running ITS on the PiDP-10.
Issues covered so far:

* [COMSAT is crashing on start](#comsat-is-crashing-on-start)

through out the document you will see certain special characters mentioned. They are slightly different depending on your keyboard and terminal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout is one word.

* `$` means the ALT or ESC key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is dangerous to mention ALT. The ALT key on many keyboards is a modifier key that must be held down while another key is pressed. This is not the same as ESC, which is a key intended to pressed on its own, or optionally with a modifier keys

Suggest removing ALT here as people will try to use it in place of ESC to no effect.

* `^` means the CTRL or STRG key
* `<escape>` means the ALT or ESC key in EMACS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment regarding ALT as above.

* `<control>` means the Control key in EMACS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just in emacs. Everywhere on ITS, including DDT.


## COMSAT is crashing on start
Thanks to eswenson for the intial steps here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial is spelled wrong.


In order for INQUIR entries to stick, you must have COMSAT running.

If you run PEEK, you should see two COMSAT jobs. One has the JNAME IV and the other JOB.nn. If these jobs are not present, then COMSAT may have started and died or not started at all
```
*:peek
KA ITS 1651 Peek 631 8/14/2024 09:59:35 Up time = 5:05
Memory: Free=457 Runnable Total=11 Out=3 Users: High=13 Runnable=0
Index Uname Jname Sname Status TTY Core Out %Time Time PIs
0 SYS SYS SYS HANG ? 71 0 0% 1
1 CORE JOB CORE UUO ? 0 0 0%
2 MIKEK HACTRN MIKEK HANG > 30 9 0%
12 MIKEK PEEK MIKEK +TTYBO T52 C 11 2 0%
3 .BATCH BATCHN .BATCH SLEEP ? 126 23 0%
4 TARAKA CNAVRL CNAVRL 10!0 ? DSN 29 0 0% .VALUE
5 GUNNER GUNNER GUNNER _SLEEP ? 11 3 0%
6 TARAKA PAPSAV PAPSAV HANG ? 1 0 0%
7 TARAKA NAMDRG NAMDRG HANG ? 29 0 0%
10 PFTHMG DRAGON PFTHMG HANG ? 6 0 0%
11 TARAKA JOB.07 SYS HANG ? 3 0 0%
Fair Share 99% Totals: 317 0% 1
Logout time = Lost 0% Idle 98% Null time = 5:07
```
As you can see above none of the COMSAT processes are running.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term is “job” on ITS, not “process”.


There are several reasons why COMSAT may die upon startup The most common are:
Lets start going through those one by one:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little weird to say the first sentence here, followed by a colon. Since the second sentence also has a colon.


### Network parameters for COMSAT are not correct.
When you bring up KA ITS, you'll see a message on the operator console like this:

 LOGIN  TARAKA 0 12:09:11
TOP LEVEL INTERRUPT 200 DETACHED JOB # 4, USR:COMSAT IV     12:09:12

This means that COMSAT has crashed.

If you look at the IP address that COMSAT is configured with:
```
comsat$j
$l .mail.;comsat launch
bughst/'NEW$:   SHOWQ+50,,PAT+6   =30052000544
```

you'll note that that octal address is: 192.168.1.100

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend you insert my writeup on how to decode those octal IP addresses and convert them to standard octet notation. See recent post to PiDP10 mailing list.

If you look at the value that ITS has for the machine's IP address:

```
sys$j!
*impus3=1200600006
```

You'll see that that that octal address is: 10.3.0.6

And if you look at the host table (SYSHST;H3TEXT >), you'll find an entry like this:
```
HOST : CHAOS 177002, 192.168.1.100 : DB-ITS.EXAMPLE.COM, DB : PDP-10 : ITS : :
```

(And there is no HOST entry for a machine with the name KA).

The easiest fix is to:

1) fix the host table
2) fix COMSAT's variables
3) generate COMSAT's database files
4) fix COMSAT's mailing lists file
5) restart COMSAT

To fix the host table, change the line:
```
HOST : CHAOS 177002, 192.168.1.100 : DB-ITS.EXAMPLE.COM, DB : PDP-10 : ITS : :
```
to
```
HOST : CHAOS 177002, 10.3.0.6 : KA : PDP-10 : ITS : :
```
Save the updated `SYSHST;H3TEXT >` and then compile the host table:
```
:SYSHST;H3MAKE
```
Make sure that there were no errors (look for a `H3ERR` file) and make
sure that there exists a file `SYSBIN;HOSTS3 NNNNNN` where `NNNNNN` matches
the `FN2` of the `SYSHST;H3TEXT NNNNNN` you just created.

Now your host table matches your ITS IP address.

Next, you need to fix COMSAT.

To do that, create a job for COMSAT:
```
comsat$j
```
Then load in the compiled (but not dumped) binary for COMSAT
```
$l .mail.;comsat bin
```
And now set various variables:
```
BUGHST/1200600006
DEBUG/0
xvers/0
```
And then purify the binary:
```
purify$g
```
and when DDT prints out:
```
:PDUMP DSK:.MAIL.;COMSAT LAUNCH
```
Type an `<enter>` to confirm.

Now, you have an correct `.MAIL.;COMSAT LAUNCH` executable.  This will be
launched by `TARAKA` on startup, or by `:MAIL` when invoked if `COMSAT` isn't
running.

However, before you do this, you need to make sure that COMSAT's database
files are created.

To do that, do this:
```
comsat$j
$l .mail.;comsat launch
debug/-1
mfinit$g
```
You should see a message like:
```
:$ File Directory Initialization successfully completed...
Proceeding will launch Comsat. $
*
```
Don't proceed the COMSAT job, because it will be run as your
UNAME rather than COMSAT's.  Simply kill the COMSAT job:
```
:kill
```
Now, there is one last step.  The file `.MAIL.;NAMES >` has entries
for DB (ITS) rather than KA.  It needs updating.

In emacs, open up `.mail.;names >` and do a query replace of all instances of DB
with KA.

To do that, enter the Query Replace command:
```
<escape>%
```
The echo area should display:
```
MM Query Replace$
```
Type in `DB<escape>KA<escape><escape>`

Your cursor will be positioned at the first instance of the string DB.

Type in

`!`
Yes, just the exclamation point character.  This will replace all instances
of `DB` with `KA`.

Save the file. (`<control>x<control>s`) and return to DDT (`<control>x<control>c`).

Now, you are ready to launch `COMSAT`.

But first, make sure there is no (dead) comsat running, but running `peek^k`

Look for any job with the UNAME COMSAT (and the JNAME IV).  If you find one,
kill the job by typing:

`<job number>X`

Then, exit PEEK with the "q" command.

Now, send yourself a message:
```
:MAIL <your-uname>
<some message>
<control>c
```
You should see the message:
```
C Communications satellite apparently dead.
Re-launching, hang on... now in orbit!
```
Now, COMSAT should be running.  You can check with PEEK.

You also should see that your mail was delivered. Type:
```
:PRMAIL<enter>
```
to read (and optionally delete) it.



Loading