- Coding up a direct graph representation from the .seq traces of the program that pH is creating. - coding up the search algorithm. Wed May 2 00:13:31 PDT 2001 I'm currently running pH on a machine which does nothing more than run a screen session. Every once in a while i need to do something on the shell of the machine, and at this point all the commands that i use are delayed a lot as a result of my shell interactions not being part of the standard sequence lists. Wed May 2 11:39:58 PDT 2001 pH uses a bitfile to store the .seq traces. graph.c analyzes the bitfields and creates a matrix representation of a graph. For the STIDE data, I could do this with perl instead of C since the sequence data they provide is a text file. Thu May 3 20:24:49 PDT 2001 I'm looking at a couple of things at the moment. Going over docs from the STIDE paper looking at STIDE source, and looking at their wu-ftpd testing. The wu-ftp section (http://www.cs.unm.edu/~immsec/data/synth-ftp.html) uses STIDE to make a sequence file for wu-ftpd on linux. I'm going to guess (and check) that the mapping file is the same as the one used by pH on my system. Thu May 3 22:22:15 PDT 2001 Some system calls cannot be no-oped. For instance, any syscall that does not rely on its argument list to function is going to affect the process under attack. Fork(2) and _exit(2) are examples of this. Write function that takes in a syscall list, sequence list and creates a graph of the sequence list, and then is capable of querries. Sun May 6 01:45:03 PDT 2001 Wrote graphinit.c - code for taking a STIDE sequence and creating a struct unit that hold both the key and the graph (stored as a adjacency matrix) from the sequence files (using the ftp one). The struct is then saved to a savefile. Started work on pathfind.c - code to take the struct from the savefile and works with the semantics "% pathfind " - whereupon it should return the path (with all intermediary system calls) or say such a path is not possible. Mon May 28 09:12:43 PDT 2001 Fixes for compilation of stide code under gcc 2.95.2 ibm:~/stide/stide_v1.1>gcc -v Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.2/specs gcc version 2.95.2 20000220 (Debian GNU/Linux) error: seq_config.cc:284: ANSI C++ forbids using pointer to a function in arithmetic fix: exit -1 should be exit (-1) error: In file included from template.cc:1: ../Utils/arrays.cc: In method `void CompSortableArray::Sort()': ../Utils/arrays.cc:113: name lookup of `i' changed for new ANSI `for' scoping ../Utils/arrays.cc:109: using obsolete binding at `i' fix: (line 113 should be:) for (int i = sz-1; i >= 1; i--) { error: flexitree.cc:318: ANSI C++ forbids declaration `IsSeqInForest' with no type fix: since this returns 0, it should have int as its type for return value. int SeqForest::IsSeqInForest(const Array &seq, int seq_len) const Magic templates. they link with template.cc which creates all their templated classes. hence they split up the template headers and the template declarations. Fri Jun 1 11:51:10 PDT 2001 Notes for stide code: Stide analyzes serveral streams at once as the stide binary takes in (pid, syscall) paird on stdin. Each unique (pid, *) is a stream. Each stream is broken up into UNIQUE sequences of system calls of some length k. these are stored in a database, as trees, with the root of the tree being the first system call in the sequence. When a string of system calls is checked for anomality; the string's first system call is matched with the root of a sequence in the database. Then the rest of the trace is compared to get the minimum Hamming Distance d(i,j) between i and all j in the database, where j is a sequence. Hamming distance: given 2 bitstrings of same length, the hamming distance between them is the # of bits that differ - this generalizes to "pitstrings": ie strings of digits in p-ary variables in main: cfg: configuration object active_stream: ptr to the current "process stream" sid_table: external stream ids -> internal stream ids normal: SeqForest (unitialized forest of normal sequences) streams: array of stream objects one for each process stream. num_streams_fnd: number of process streams encountered to date total_pairs_read: number of (pid, syscall) pairs read from input from all the datastreams combined (can be offset with -n) int db_size: number of unique sequences in the database specified with -d 1st thing main does is read the cfg.db_name, and the seq len. seq len is very important since it wont know how to chop up the trees. reads it into normal. in the ReadDB function we have: while (!in_db_file.eof()) { if (root == -1) break; db_forest.trees_found[root]++; // add one more to the number of trees that stem from this syscall in_db_file>>db_forest.trees[root]; // shove the in_db_file into the seqforest. db_size += db_forest.trees[root].NumLeaves(); //increment db_size in_db_file>>root; //go on to the next number } where trees_found is a array of int. where db_forest.trees is an array of Flexitrees this processes something of the type: 3 3 4 2 9 10 3 -4 3 9 8 -2 3 -3 4 9 -1 2 2 3 4 5 6 7 -3 2 9 -1 GetReadyStream returns a Stream which then is placed in active_stream stide then calls active_stream->CompareSeq Each "stream" is a class. A stream can tell if it is anomalous with CompareSeq (which calls ComputeMisses). If ComputeMisses returns with 0 then the seq has been found in the database. Need to implement: Read in the mydb.db file using the ReadDB function to have the seqforest - then we can build up a Stream* attackstream with our syscalls padded in attackstream->current_seq; and we can call if (attackstream->CompareSeq(...,normal,0)==0) { cout << "success"; } Sun Jun 10 23:26:36 PDT 2001 Implemented the access matrix version of the graph but it is wrong, as it does not have a concept of state. For example, if the 2 trees are stored: 1 2 3 7 8 3 4 5 6 7 the graph will report back that the sequence Z: (1 2 3 4 5 6 7 8) is possible, even tho most subsequences of length 5 from Z are not going to match the trees stored. 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 <- the only match. 4 5 6 7 8 need to store state. also, the algorithm to just decompose an exploit thread as a series of heads of flexitrees. For example, lets have sequences an exploit "6 1 4" stream: 4 1 3 5 2 7 6 6 7 breaks down to: 4 1 3 1 3 5 3 5 2 5 2 7 2 7 6 7 6 6 6 6 7 lets try the exploit stream 6 (6 7) 1 (3 5) 4 (1 3) this would decompose and be checked by: 6 6 7 6 7 1 * 7 1 3 * 1 3 5 3 5 4 * 5 4 1 * 4 1 3 where * doesnt match anything in the database.