FreeS/WAN -- KLIPS HARDWARE ACCELERATION NOTES (draft 3) ============================================================================= Wed May 2 20:19:18 EDT 2001 This is a work in progress; if you are interested how it progressed go and read the previous revisions from http://www.jukie.net/~bart/linux-ipsec/ In this revision I have given more consideration to actual objects involved in parts of the proposed KLIPS2. At this time the linux-ipsec mailing list is picking up momentum in the topics of KLIPS2 development... I hope to get some good feedback on the ideas within. == KLIPS IN PARTS == So the first thing to mention is that KLIPS will be split. The parts that will be created will better define the different jobs that KLIPS2 performs. Each separate part will have a well defined interface - this will allow for independent updates of each without disturbing the rest. There parts would be: 1) tunnel processing engine (tunnel database and pfkey interface) 2) protocol processing engine (ESP/AH packet mangling) 3) crypto processing engine (crypto functionality) The tunnel processing engine will tie into the IP-stack using netfilter; the first operation done here is to match the skb to an SA (This part is not considered in detail in this document - I leave that for RGB who has devoted considerable effort to it already). The protocol processing engine will receive a tunnel descriptor and an skb for processing from the tunnel processing engine. In turn the protocol processing engine will rely on the crypto functionality to do the - drum roll please - crypto operations. == CRYPTO ENGINE == The crypto engine contains functions like enc_3des, dec_3des, sign_sha1, verify_sha1, compress_ipcomp, decompress_ipcomp, etc. KLIPS comes with software implementations of these functions in the crypto engine. The interface to each of the functions must be static/consistent and flexible to easily embrace new protocols. Another requirement of the above mentioned functions is for them to return two types of success states along with a variety of error conditions. The latter are always negative. The two success conditions are COMPLETED and DETACHED. COMPLETED means that the operation was finished successfully, most likely in software, while DETACHED identifies those cases where the a separate context of execution was used to complete the task. The DETACHED return is needed to implement a system that can schedule tasks in a dedicated hardware processor (yes, idea is borrowed from netfilter's NF_STOLEN). Of course the context must at some point return to the crypto engine via some callback. This is handled by the protocol engine; read on. A crypto engine consists of a data-structure with some general info about the algorithm and some function pointers to create context (key) and to perform the work (encrypt/hash/compress/etc). The following is an example of such a structure: struct alg_engine { char * name; /* due to the lack of my creativity */ int block_size; /* these fields are borrowed from */ int iv_size; /* the international patch */ int key_schedule_size; u32 key_size_mask; u32 jobsize; /* buffer required to hold all of the * context information per packet */ struct engine_ops { int (*create_instance) (...); int (*delete_instance) (...); } struct instance_ops { int (*forward) (...); /* encrypt/sign/compress */ int (*reverse) (...); /* decrypt/verify/decompress */ } }; The idea is that the above structure is created once for each of the algorithms supported in software (duplicated if we have hardware support). When an algorithm is requested by something the engine->create_instance() function is called to generate an instance. The instance stores the key and state current state of the transform (like the RC4 key that changes over time, or the key schedule for 3DES, etc). Here is an example instance structure: struct alg_instance { u32 key_len; u8 *key_data; u8 iv[MAX_IV_SIZE]; struct alg_engine *imp; /* algorithm implementation pointer */ }; You may have noticed that the above structures seem very similar and familiar. I personally like the kerneli design and would like to capture a lot of it in this layer in the hopes that in the future it can be replaced with kerneli. The single issue against using kerneli in KLIPS2 is that the user will be _forced_ to recompile the kernel - i.e. no way of providing a totally module distribution of KLIPS for known kernels (like the one that comes with your favorite Linux distro). However, there is one item I don't like in kerneli design. Kerneli provides a hook for software and hardware fn's for each cipher. What if you have two different types of hardware what support MD5 or 3DES? There is no nice way of adding that in. In my mind it should be possible to attach multiple implementations of one cipher. The above fits well for encryption, decryption, compression and decompression. Now, a note on split hashing functions and hardware accelerators. Currently hashing is done through a single call to (MD5|SHA1)Init, then many calls to (MD5|SHA1)Update followed by a single call to (MD5|SHA1)Final. The first call is called during the keying or re-keying of a tunnel; the second call is used for each successive update of the hash context with the data passed in; the final call is used to return the hash value and resetting the context. Now during a normal packet-processing operation there will be an approximate of of 5 update calls using AH and 2 update calls using ESP (both use 2 Finish calls). Most of these calls move the hash by a a small number of bytes and then quit... this would be horrible for hardware accelerators where you are burned on a large number of calls (the gain of hw acceleration comes from doing a small number of large bulk operations, not vice versa). My question is: would it not be possible to consolidate all these calls into one call that will do everything and just return the finished hash? If it is possible for all the hashing for one ESP/AH packet to be done by one operation then there need only be one function to process the whole bulk of the packet. If the contrary is true then we need an Update and a Final functions where the Update just queues up the data and the Final would flush the buffer and send it to hardware. This of course would be hidden from the user - different implementations would use different minimum buffer sizes before flushing the processing to the hardware. As a final note; when creating a context in hardware there may not always be sufficient context storage space on the device (some devices are limited to storing, say, 1k of session concurrently). The API to the chip would fail the insertion of the key if there was no more room in the device. The hooks are not allowed to fail in this say. Instead, the hook should cache the current sessions and if a new session creation failed in the chip it will remove the least recently used one and try to insert a new one again. == PROTOCOL ENGINE == This part of KLIPS consists of two protocol engines; one for ESP and another for AH. Each engine has two entry points: inbound and outbound. And, once again the objective is to abstract a protocol handler from being a specific AH or ESP one. When a protocol handler's function is called it's because a specific tunnel was configured to use it. When creating a protocol handler you must specify the algorithms to use and their properties (i.e. keys, etc). The init_esp and init_ah functions will in turn call the create_instance functions in the appropriate crypto engine defined by the algorithms specified (see above). It would also be possible to specify if the algorithm should be a software one or a hardware one. As it will be shown later, with a simple policy decision it will be possible to configure a connection use a specific implementation of a protocol engine. More on this later. If we have a hardware accelerator that is capable of doing protocol processing the hardware module would register a hook for an ESP or AH handler (or both). This is similar to the way an ethernet driver works (or serial, or parallel for that matter). Thus, a protocol engine consists of a data-structure with a description of the protocol and a collection of function pointers to create a context (SA) and process packets (AH/ESP). The following is an example of such a structure: struct protocol_engine { char * name; /* due to the lack of my creativity */ u32 jobsize; /* buffer required to hold all of the * context information per packet */ /* TBD */ struct engine_ops { int (*create_instance) (...); int (*delete_instance) (...); } struct instance_ops { int (*outbound) (...); /* encapsulate */ int (*inbound) (...); /* decapsulate */ } }; The above structure is created once for each instance of a protocol engine; initially KLIPS2 would come with software implementations of ESP, AH and IPCOMP. However a hook interface will be provided to "register" 3rd party protocol engines. As in the case of algorithms, the engine can create instances which represent each SA. A security association (SA) contains the necessary algorithm instances (and in turn keys) for processing an skb. Here is an example structure for a protocol instance structure: struct protocol_instance { u32 sa_id, pid, replaywin, flags, etc, etc, etc; struct alg_instance **transforms; /* what to apply in order */ struct protocol_engine *imp; /* protocol implementation pointer*/ }; During configuration of a tunnel the above structure will be created through the PFKEY2 interface. The data passed will contain keying information that will be passed to, say, ESP->create_instance() function. This function will in turn create the appropriate algorithm instance structures and store it in the protocol_instance.transforms array. When an skb arrives at, say, ESP->outbound() function the transforms are executed in sequence and applied to the skb. Chaining of protocol engine instances should not be handled by the protocol instances themselves but by the caller of the engine so in the tunnel processor, to be discussed shortly, the call sequence may be: upon getting an skb from netfilter on the LOCAL_OUT hook get first protocol instance (pi) while(pi) pi->outbound(skb) get next protocol instance for this packet end reinject the skb into the netfilter stream == TUNNEL ENGINE == This engine will handle all of the PFKey interface as well as the interface to netfilter which is essential for the actual packet mangling. The processing of skb's will be moved to the protocol engine as described above. Currently in the KLIPS code, the ipsec_rcv and ipsec_tunnel functions lock the tdb entry while working on that SA (this may have changed recently). This is OK for single processor implementation running only in software. However in some instances the tbd may be used by multiple packets concurrently. So this lock will have to go away in a KLIPS that is hardware accelerated (or even wants to use two CPUs for a single but very-very-busy tunnel). I recommend using 'use count' and 'delete bit', as outlined in Rusty's Unreliable Locking Guide, to prevent from deleting a tdb before it's time is up. == PACKET PATH IN SOFTWARE == This section will try to describe what happens what happens when a packet arrives? Well, the tunnel engine (with help from netfilter) locates the appropriate protocol engine instance(s) that were configured for a specific connection. It then calls the outbound or inbound function that is appropriate for the orientation of the skb. The protocol engine calls the algorithm function(s) (using a pointers that was set during SA creation). On the completion of all the transform functions the packet is returned to the tunnel protocol processor which may decide that another tunnel is appropriate for this packet (as it was described above) and the cycle continues. Below is the above in a diagram (for the software case); NOTE that I am using ESP with 3DES and MD5 as an example and there is no reason other algorithms could not take their place. [tunnel processing] [protocol processing] [crypto processing] PLUTO | | tdb_init(tdb, ...) |------------>| | | ESP->create_instance | | (spi, ENC_3DES, e_key, e_len, | | AUTH_MD5, a_key, a_len) | |-------------------->| | | | 3DES->create_instance | | | (flags, key, len) | | |-------------------->| | | |esp->transform[0] <--| | | | | | | MD5->create_instance | | | (flags, key, len) | | |-------------------->| | | |esp->transform[1] <--| | |tbd->pi <------------| |rc <---------| NIC | ipsec_rcv(skb) | | tdb_lookup(spi,...) |------------>| |tdb <--------| | |job = kalloc(tdb->pi.jobsize) |job->skb = skb; |job->tdb = tdb; | tdb->pi->imp->inbound | (job) |---------------------------------->| | |esp=job->tbd->pi; | | | | esp->transform[0]->imp->forward | | (esp->transform[0], | | skb+src_ofs, | | src_len, | | skb+dst_ofs, | | dst_len) | |-------------------->| | |rc <-----------------| | | | | esp->transform[1]->imp->forward | | (esp->transform[1], | | skb+src_ofs, | | src_len | | skb+dst_ofs, | | dst_len) | |-------------------->| | |rc <-----------------| |rc <-------------------------------| | (loop if unmangled packet is an ESP/AH/IPCOMP) | | reinject(skb) Anyway, you get the picture. Here are some finer points: * ipsec_rcv will loop for nested tunnels/SA's * ipsec_tunnel will be similar to ipsec_rcv * tbd->pi points to a 'struct protocol_instance' * tbd->pi->imp points to a 'struct protocol_engine' * esp is the tbd->pi passed into the inbound() function * the transforms are 3DES and MD5 implementations The point is that each layer makes no guesses how the next one works. For example we know that des encrypt and des decrypt is the same with a different key schedule. However, we do not use this and have two des functions. Advantage of this is that if a new algorithms is implemented and some trick cannot be used again the layer before does not have to change. That is to say, the protocol layer should not depend on specific implementation hacks of the algorithm layer. == PACKET PATH IN HARDWARE == The major issue about hardware acceleration is this: since most of KLIPS is running in bottom-half time (initiated by ISR of the NIC driver) it cannot sleep and wait for a device to complete it's computations. Even if it could we would not want to do this since we have better things to do; like servicing other user tasks, processing the next packet, searching for aliens, etc. When dispatching the job to a crypto processor the IP stack needs to be informed that the packet was stolen. This is quite important; if we just returned from ipsec_rcv then the skb could have been deleted while we were still working on it in parallel. Here is my interpretation of what will happen if you add hardware to the above diagram. [tunnel processing] [protocol processing] [crypto processing] NIC | ipsec_rcv(skb) | | tdb_lookup(spi,...) |------------>| |tdb <--------| | |job = kalloc(tdb->pi.jobsize) |job->skb = skb; |job->tdb = tdb; | tdb->pi->imp->inbound | (job) |---------------------------------->| | |esp=job->tbd->pi; | | | | esp->transform[0]->imp->forward | | (esp->transform[0], | | skb+src_ofs, | | src_len, | | skb+dst_ofs, | | dst_len) | |-------------------->| | | |----> dispatch H/W | |rc <-----------------| |rc <-------------------------------| [ millions of nanoseconds later in code near by... ] H/W interrupt | ipsec_callback(job) | | tdb->pi->imp->inbound | (job) |---------------------------------->| | |esp=job->tbd->pi; | | | | esp->transform[1]->imp->forward | | (esp->transform[1], | | skb+src_ofs, | | src_len | | skb+dst_ofs, | | dst_len) | |-------------------->| | | |----> dispatch H/W | |rc <-----------------| |rc <-------------------------------| [ millions of nanoseconds later in code near by... ] H/W interrupt | ipsec_callback(job) | | tdb->pi->imp->inbound | (job) |---------------------------------->| |rc <-------------------------------| | (may need to repeat ipsec_rcv if unmangled packet is an ESP/AH) | | netif_rx(job->skb) | free_skb(job) A bit more involved but pretty much the same idea. The difference is that the continuity is broken twice; once to do encryption and once to do authentication. Here is a list of notes for this diagram: * job is a buffer that stores the information about one transaction; it is used internally to store all local variables that need to be kept around between the dispatch of the operation and the matching interrupt. * jobsize is calculated ahead of time during SA creation and contains the number of bytes used in the protocol processor and crypto processor. It is noteworthy to mention that there may be multiple jobs per tbd. This will happen most frequently on the receiver, if the sender is much faster and can swamp its counterpart. For this reason we keep a separate 'job' structure for each packet that comes into KLIPS. == HARDWARE ACCELERATION POLICY == As briefly mentioned above you should be able to assign a crypto engine - be it software or an instance of a hardware accelerator - to a connection or even to an SA. For example say you have a massive security gateway (SG) with a fair number of static connections and a large number of possible road-warriors that may connect. You feel that the static connections are more important so you allocate a hardware engine to them but not to the %default connection. Thus each road-warrior would get the default engine (i.e. software). Say that in this box we have two accelerators named alpha and beta. Alpha is capable of performing whole packet processing for esp only. Beta is only able to perform 3des; a sample configuration could look like this: config setup hwload=alpha,beta ... conn %default doesp=software doah=software do3des=software domd5=software dosha1=software ... conn static-1 doesp=alpha ... conn static-2 do3des=beta ... conn roadwarrior ... In the above example alpha and beta are names of drivers that reside in /usr/modules/your.kernel.ver/kernel/net/ipsec along side ipsec.o. Thus after loading ipsec.o, but before setting up any tunnels, the setup script will have to check the hwload variable and load the modules listed there. These modules are most likely just wrappers around crypto drivers and will require that the main driver be loaded as well - modutils takes care of that for us. We have discussed signing crypto engines for loading into freeswan and decided against it as loading them requires root privileges. If you have root you can do it all anyway. As always... comments and critique are most appreciated. Regards, Bart Trojanowski ============================================================================= This document is a copyright of Bart Trojanowski. If available, an updated version of this document can be found at: http://www.jukie.net/~bart/linux-ipsec/