FreeS/WAN -- KLIPS HARDWARE ACCELERATION NOTES (draft 4) ============================================================================= $Id: freeswan-hardware-acceleration-draft-4.txt,v 1.2 2001/09/10 15:41:29 bart Exp $ 0. INTRODUCTION 0.1 A bit about the history of this document This is a work in progress; if you are interested how it progressed go and read the previous revisions from http://www.jukie.net/~bart/linux-ipsec/ This revision takes into consideration the design and development that has been done on the sister project which is called Generic Engines. It is located at http://www.jukie.net/~bart/genericengine/. Generic Engines will be used as a framework for the engines talked about in this document. It is assumed in this document that you have some idea what a Generic Engine (GE) is. You may wish to read the URL just mentioned for clarification. I have met with Richard Briggs and Michael Richardson and their comments were positive. 0.2 Adventures in kerneli You may notice that the structures and concepts will seem very similar and familiar as you read this document. I personally like the kerneli design and would like to capture a lot of it in this layer in the hopes that in the future it can be replaced with kerneli or the opposite: that GE can be a replacement for kerneli. One of the reasons for not using kerneli in KLIPS2 is that the user will be _forced_ to recompile the kernel - i.e. no way of providing a totally module distribution of KLIPS for known kernels (like the one that comes with your favorite Linux distro). These is a small problem as I have demonstrated that kerneli can be made into a stand alone module; See http://www.jukie.net/~bart/linux-ipsec/kerneli-module/ for details. One thing I don't like about kerneli is this... kerneli provides a hook for software and hardware fn's for each cipher. What if you have two different types of hardware what support MD5 or 3DES? There is no nice way of adding that in. In my mind it should be possible to attach multiple implementations of one cipher. Finally, it will be impossible for KLIPS2 to use kerneli for it's hardware cipher database. The reason is that while kerneli supports hardware/asynchronous cipher engines it maintains a synchronous calling convention. That is the caller is forced to sleep (on a wait queue) while the processing occurs. This is fine if you are servicing user space requests like disk reads and writes. However, KLIPS operates in the interrupt domain for 99% of the time. It is impossible thus to put the calling task on a wait queue -- you cannot sleep in an interrupt. 1. KLIPS IN PARTS So the first thing to mention is that KLIPS will be split. The parts that will be created will better define the different jobs that KLIPS2 performs. Each separate part will have a well defined interface - this will allow for independent updates of each without disturbing the rest. There parts would be: a) tunnel processing engine (tunnel database and pfkey interface) b) protocol processing engine (ESP/AH packet mangling) c) crypto processing engine (crypto functionality) The tunnel processing engine will tie into the IP-stack using netfilter; the first operation done here is to match the skb to an SA (This part is not considered in detail in this document - I leave that for Richard and Michael who have devoted considerable effort to it already). Please see http://www.sandelman.ottawa.on.ca/SSW/freeswan/klips2req/ for more details on the overall KLIPS redesign plan maintained by Michael Richardson. Both of the protocol processing engine and the crypto processing engine will be based on the Generic Engines infrastructure. To find out more about this project visit http://www.jukie.net/~bart/genericengine/. In brief, the GE infrastructure provides KLIPS2 with a database of protocol and cryptographic algorithms which can be used in a transparent manner. The specifics of this are discussed in this document; it is expected that you the reader has been exposed to the design documents for GEs as she reads this document. 2. CRYPTO ENGINES The cryptographic algorithm engines are the simplest to implement, as in most cases the GE becomes a wrapper for existing code. For example taking implementing a DES crypto engine involves wrapping the key schedule into the GE context structure definition and write a few initialization functions. The crypto engines will provide KLIPS2 with encryption and decryption functions for triple-DES and AES (initially -- more may follow), as well as digest functions for MD5 and SHA-1 (again more may follow). 2.1 Overview of a crypto engine When KLIPS2 receives a request from Pluto (via the pfkey v2 interface) to construct a new tunnel it will create the appropriate crypto and digest contexts. For example if the ESP tunnel required a triple-DES cipher to encrypt packets then the following code would be called to initialize the GE context to process packets: struct generic_engine *engine; struct generic_context *context; struct cipher_des_ede3_context_definition defn; engine = get_engine_by_name("cipher-des_ede3"); defn.operation = ENCRYPT_CBC; defn.key_ptr = key; defn.key_len = sizeof(key); defn.iv_ptr = iv; defn.iv_len = sizeof(iv); create_context( engine, &defn, GFP_ATOMIC, &context); At this time the 'engine' pointer is no longer required it may be thus returned/freed by calling put_engine as shown below. By creating the context object the reference count of the engine was increased; this prevents the engine from being deleted (by doing an rmmod on the module which defines the engine, for example). Since the engine code is safely held in the kernel the 'engine' pointer is no longer needed, and thus: put_engine( engine ); Given that the 'create_context' succeeds the 'context' variable will point to a GE context that can be used for future processing of data to be encrypted by the 'cipher-des_ede3' engine. The code which will perform the actual processing of packets will look something like this: create_job( con, GFP_ATOMIC, &job ); job->data.in_ptr = plain_text; job->data.in_len = sizeof(plain_text); job->data.out_ptr = cipher_text; job->data.out_len = sizeof(cipher_text); job->callback = callback_function; job->opaque = context; execute_job( job ); Given that the 'execute_job' succeeds, the task defined by the 'job' structure (ie encrypt plain_text into cipher_text) will be executed asynchronously by the des_ede3 GE. One thing must be stressed at this point; after executing the job the context which initiated the execution cannot know the state of the job. It must wait for the callback before it can determine that the job execution was successful or if it failed. Upon the successful, or failed, completion of this operation the calback_function will be invoked. At this time the 'result' attribute of the 'job' structure contains the success or failure state of the operation. The callback must cleanup the job structure if it does not chose to use it again; the following code shows this. if( job->result ) handle_failure(); else handle_success(); release_job( &job ); Finally when the context has outlived it's usefulness it will be removed from the system by calling release_context. Releasing of a context will not occur until all references to this context are released as well. This includes any jobs that may have been created/executed on the context. While release_context will not block the memory occupied by the object will remain till the reference count is ZERO. Only the module that calls create_context should ever call release_context explicitly. release_context( &context ); NOTE: When freeing the engine we called a put_ function. When freeing the job and the context the function prefix was release_. In the GE paradigm put_ is used to identify a function which only decrements the reference count; similarly release_ will decrement the reference count, but if this reference dips to ZERO the object will also be deleted from memory. It is important not to use put_job and put_context directly, although they exist, as they will not delete the objects that were created by create_job and create_context, respectively. NOTE also: All functions seen above, with the exception of get_engine_by_name, return a negative value on error and a non-negative (including ZERO) to indicate success -- as per UNIX tradition. In some cases a positive vs a zero return may indicate a detached or completed task. See engine.h, part of the generic engine distribution, for detains on these return codes. 2.2 Configuration specifics All generic engines are grouped into categories. Once again the concentration is on one engine; the software implementation of triple DES. This engine is registered with the GE database with a key of "cipher-des_ede3-sw" and is part of the "cipher-des_ede3" family, or group, of engines. All engines which are part of the same group perform the same functionality and thus can be replaced by other implementations of the same group. A group of generic engines is thus defined as a group of engines which are transparently replaceable with another within the group. Each engine of the same group is named such that the group name is the prefix of the engine name followed by a unique identifier of the engine. Ex in the group "cipher-des_ede3" there could be the following variants of the triple des cipher engine: cipher-des_ede3-sw software implementation cipher-des_ede3-luna_vpn hardware on Luna VPN card cipher-des_ede3-pro100s hardware on Intel Pro100/S When a context is created using a specific engine a definition structure is used to initialize the context. The definition (defn) contains all variables which could cause one context to behave differently than another given similar data. In the case of the "cipher-des_ede3" group of engines the definition structure is templated as follows: struct cipher_des_ede3_context_definition { enum { ENCRYPT_CBC, DECRYPT_CBC,ENCRYPT_ECB, DECRYPT_ECB } operation; /* ENCRYPT or DECRYPT in CBC or ECB mode */ u8 *key_ptr; /* key for 3DES */ u8 key_len; u8 *iv_ptr; /* starting IV */ u8 iv_len; }; This definition structure is passed to create_context. It is short lived and can be deleted from system memory as soon as the create_context returns. This will prevent keys from being held in memory for too long. If the engine is exposing functionality of a hardware accelerator board (or chip) then the key material may be stored in a protected manner in the chip. By deleting/clearing the definition structure after calling create_context the only copy of the key in system memory will be deleted and improving overall security. As a trait of the generic engine group definition, any group of engines will have shared C header file which uses the name of the group followed by a .h suffix to define the file name. Thus for the triple DES cipher the configuration data is stored in "cipher-des_ede3.h" and at the very least defines "struct cipher_des_ede3_context_definition". The key schedule, for example, is very implementation dependent -- recall the key schedule can be buried in a hardware device -- and thus is not part of any common structure. You can consider the defn structure to be a seed for the algorithm's operation. The context structure however is long lived, at least as long as the triple DES engine needs to be used for encrypting or decrypting. The context would thus store the key schedule for the software implementation, or at least a handle to a key schedule that is stored in a hardware device. In addition the IV may be stored here is the chained IV technique is used to generate the IV for the next packet on outbound IPSec operations. 2.3 Digest functions The above fits well for encryption, decryption, compression and decompression. Now, a note on split hashing functions and hardware accelerators. Currently hashing is done through a single call to (MD5|SHA1)Init, then many calls to (MD5|SHA1)Update followed by a single call to (MD5|SHA1)Final. The first call is called during the keying or re-keying of a tunnel; the second call is used for each successive update of the hash context with the data passed in; the final call is used to return the hash value and resetting the context. During a normal packet-processing operation there will be an approximate of of 5 update calls using AH and 2 update calls using ESP (both use 2 Finish calls). Most of these calls move the hash by a a small number of bytes and then quit... this would be horrible for hardware accelerators where you are burned on a large number of calls (the gain of hardware acceleration comes from doing a small number of large bulk operations, not vice versa). My question is: would it not be possible to consolidate all these calls into one call that will do everything and just return the finished hash? If it is possible for all the hashing for one ESP/AH packet to be done by one operation then there need only be one function to process the whole bulk of the packet. If the contrary is true then we need an Update and a Final functions where the Update just queues up the data and the Final would flush the buffer and send it to hardware. This of course would be hidden from the user - different implementations would use different minimum buffer sizes before flushing the processing to the hardware. 2.4 Low memory for contexts in hardware devic3 As a final note; when creating a context in hardware there may not always be sufficient context storage space on the device (some devices are limited to storing, say, 1k of session concurrently). The API to the chip would fail the insertion of the key if there was no more room in the device. The hooks are not allowed to fail in this way. Instead, the hook should cache the current sessions and if a new session creation failed in the chip it will remove the least recently used one and try to insert a new one again. 3. PROTOCOL ENGINES This part of the document will consider only two protocol engine groups; one for ESP and another for AH. In the naming convention defined above these are "proto-esp" and "proto-ah", respectively. However, other engine groups will exist to support IPIP, IPComp, and possibly others. 3.1 Overview of a protocol engine A protocol engine functions in a similar way to all other generic engines. The user needs to acquire the appropriate engine pointer so that in can create a context: struct generic_engine *engine; struct generic_context *context; struct proto_esp_context_definition defn; struct cipher_des_ede3_context_definition des_ede3_defn; struct digest_hmac_md5_context_definition hmac_md5_defn; engine = get_engine_by_name("proto-esp"); defn.operation = ENCAPSULATE; defn.spi_num = 0x100; defn.sequence = 0; defn.ip_ver = PF_INET; defn.flags = ANTI_REPLAY_32BIT | SEQUENCE_OVERFLOW_NOTIFY; defn.tun_src.v4.s_addr = htonl(INADDR_LOOPBACK); defn.tun_dst.v4.s_addr = htonl(INADDR_LOOPBACK); defn.cipher_name = "cipher-des_ede3"; defn.cipher_defn = &des_ede3_defn; defn.digest_name = "digest-hmac_md5"; defn.digest_defn = &hmac_md5_defn; create_context( engine, &defn, GFP_ATOMIC, &context); put_engine( engine ); The above 'defn' structure is used to configure a new ESP context. As can be observed the context is configured to operate in the outbound direction (operation = ENCAPSULATE) with the initial sequence number set to zero and using the SPI of 0x100. The 'defn.flags' provide futher options such as a 32bit anti replay window (as opposed to using a 64 bit window) and a sequence overflow notification wich will pass a special error condition to the callback when the sequence overflows. The 'defn' structure also takes on the names and definitions for the engines which are used to perform crypto operations. If any of these are not supplied (ie the name and defn are left as NULL's) then the cipher is not used in the ESP process. Similarly if the tunnel end points are not defined then the engine will process packets in a transport mode. As mentioned in section 2, each engine will have its own typedef which will embody the definition structure. The AH protocol will for example lack the cipher configuration parameters and other options. Once the context is created the engine may be used for processing jobs. In the case of the ESP protocol engine, the jobs are encapsulated skb structures which need to be processed in the outbound or inbound directions. The direction of processing is set in the context and cannot be changed. To process packets in the inverse direction another context must be generated and configured accordingly. The job is constructed in the same manner as in the case of a crypto engine; the job structure is created and initialized to encapsulate the skb pointer; then the job is executed: create_job( con, GFP_ATOMIC, &job ); job->data.in_ptr = skb_in; job->data.out_ptr = skb_out; job->callback = callback_function; job->opaque = context; execute_job( job ); Note that in the example above the skb_in and skb_out can point to the same buffer. The execution of the job will complete all operations on the skb including any crypto and digest operations that may need to be done. In total the engine processing may detach many times during the execution of this task; for example there may be 3 hardware assist devices to compute the triple DES operation, to compute the MD5 hashing and finally another one to keep track of replay windows. It is also possible that this task is done completely in software or on one hardware device. Upon a completion of the job the callback function is called to process the result of the operation; it is important that the callback release the job structure. if( job->result ) handle_failure(); else handle_success(); release_job( &job ); Finally when a context is no longer needed the module which created it in the first place should call: release_context( &context ); 3.2 Configuration specifics The configuration of a protocol engine is similar to the configuration of any other generic engine. Given an engine group of "proto-esp" the user of the engine would include the proto-esp.h eader file. From this header file the structure seen below would be used to configure the new context of proto-esp-sw or proto-esp-acme. struct proto_esp_context_definition { /* ESP specifics */ enum {ENCAPSULATE,DECAPSULATE} operation; u32 spi_num; enum { SEQUENCE_OVERFLOW_NOTIFY = 1, ANTI_REPLAY_32BIT = 2, ANTI_REPLAY_64BIT = 4 } flags; /* tunnel specifics */ sa_family_t ip_ver; /* PF_INET or PF_INET6 */ esp_ip_address tun_src; esp_ip_address tun_dst; /* crypto op specifics */ char * cipher_name; void * cipher_defn; char * digest_name; void * digest_defn; }; The 'esp_ip_address' in the above structure allows for the IPv4 or IPv6 addresses to be set as the end-points for the ESP tunnel (in tunnel mode); to select transport mode both 'tun_src' and 'tun_dst' are left ZERO-ed. 3.3 Engine embedding It is clear that the ESP engine will not duplicate the 3DES and MD5 algorithms; instead a context of the software ESP engine will contain contexts of the cipher-des_ede3 and digest-hmac_md5 engines. Note that in section 3.2 the 'proto_esp_context_definition' structure contains hooks for cipher/digest names and definitions. If the ESP is to be encrypt only then the digest name/defn are set to NULL. When a job is executed on the proto-esp context two distinct jobs will be executed on the embedded contexts. Either of both may be executed in hardware. If there is a requirement to execute one after another then the engine which encapsulates the other must make that temporal order happen. The simplest way to achieve this is to execute the second job from the successful callback of the first job. 4. TUNNEL ENGINE This engine will handle all of the PFKey interface as well as the interface to netfilter which is essential for the actual packet mangling. The processing of skb's will be moved to the protocol engine as described above. Currently in the KLIPS code, the ipsec_rcv and ipsec_tunnel functions lock the tdb entry while working on that SA (this may have changed recently). This is OK for single processor implementation running only in software. However in some instances the tbd may be used by multiple packets concurrently. So this lock will have to go away in a KLIPS that is hardware accelerated (or even wants to use two CPUs for a single but very-very-busy tunnel). I recommend using 'use count' and 'delete bit', as outlined in Rusty's Unreliable Locking Guide, to prevent from deleting a tdb before it's time is up. 5. PACKET PATH IN SOFTWARE This section will try to describe what happens what happens when a packet arrives? Well, the tunnel engine (with help from netfilter) locates the appropriate protocol engine instance(s) that were configured for a specific connection. It then calls the outbound or inbound function that is appropriate for the orientation of the skb. The protocol engine calls the algorithm function(s) (using a pointers that was set during SA creation). On the completion of all the transform functions the packet is returned to the tunnel protocol processor which may decide that another tunnel is appropriate for this packet (as it was described above) and the cycle continues. Below is the above in a diagram (for the software case); NOTE that I am using ESP with 3DES and MD5 as an example and there is no reason other algorithms could not take their place. [tunnel processing] [protocol processing] [crypto processing] PLUTO | | tdb_init(tdb, ...) |------------>| | | create_context( esp_engine, definition, | | GFP_ATIOMIC, &esp ) | |-------------------->| | | | create_context(des3_engine, | | | GFP_ATIOMIC, &des3) | | |-------------------->| | | |esp->cipher_con <--| | | | | | | create_context(md5_engine, | | | GFP_ATIOMIC, &md5) | | |-------------------->| | | |esp->digest_con <--| | |tbd->pi <------------| |rc <---------| NIC | ipsec_rcv(skb) | | tdb_lookup(spi,...) |------------>| |tdb <--------| | |create_job( tdb->con, GFP_ATOMIC, &job ); |job->skb = skb; |job->tdb = tdb; | execute_job( job ) | /* MACRO: job->con->ops.process( job ) */ |---------------------------------->| | | des3 = job->con->cipher_con; | | | | create_job(des3,GFP_ATOMIC,&cjob) | | | | execute_job( cjob ) | |-------------------->| | |rc <-----------------| | | | | md5 = job->con->digest_con; | | | | create_job(md5,GFP_ATOMIC,&djob) | | | | execute_job( djob ) | |-------------------->| | |rc <-----------------| |rc <-------------------------------| | (loop if unmangled packet is an ESP/AH/IPCOMP) | | reinject(skb) Anyway, you get the picture. Here are some finer points: * ipsec_rcv will loop for nested tunnels/SA's * ipsec_tunnel will be similar to ipsec_rcv * tbd->con points to a context of the ESP protocol engine * tbd->con->cipher_* points to the cipher/digest contexts The point is that each layer makes no guesses how the next one works. For example we know that des encrypt and des decrypt is the same with a different key schedule. However, we do not use this and have two des functions. Advantage of this is that if a new algorithms is implemented and some trick cannot be used again the layer before does not have to change. That is to say, the protocol layer should not depend on specific implementation hacks of the algorithm layer. 6. PACKET PATH IN HARDWARE The major issue about hardware acceleration is this: since most of KLIPS is running in bottom-half time (initiated by ISR of the NIC driver) it cannot sleep and wait for a device to complete it's computations. Even if it could we would not want to do this since we have better things to do; like servicing other user tasks, processing the next packet, searching for aliens, etc. When dispatching the job to a crypto processor the IP stack needs to be informed that the packet was stolen. This is quite important; if we just returned from ipsec_rcv then the skb could have been deleted while we were still working on it in parallel. Here is my interpretation of what will happen if you add hardware to the above diagram. [tunnel processing] [protocol processing] [crypto processing] NIC | ipsec_rcv(skb) | | tdb_lookup(spi,...) |------------>| |tdb <--------| | |create_job( tdb->con, GFP_ATOMIC, &job ); |job->skb = skb; |job->tdb = tdb; | execute_job( job ) | /* MACRO: job->con->ops.process( job ) */ |---------------------------------->| | | des3 = job->con->cipher_con; | | | | create_job(des3,GFP_ATOMIC,&cjob) | | | | execute_job( cjob ) | |-------------------->| | | |----> dispatch H/W | |rc <-----------------| |rc <-------------------------------| [ millions of nanoseconds later in code near by... ] H/W interrupt | esp_callback(job) | |--------------------->| /* test for success */ | | | | md5 = job->con->digest_con; | | | | create_job(md5,GFP_ATOMIC,&djob) | | | | execute_job( djob ) | |-------------------->| | | |----> dispatch H/W | |rc <-----------------| |rc <------------------| [ millions of nanoseconds later in code near by... ] H/W interrupt | esp_callback(job) | |--------------------->| /* test for success */ | ipsec_callback <---------------------| /* issue callback into client code */ | (may need to repeat ipsec_rcv if unmangled packet is an ESP/AH) | | netif_rx(job->skb) | free_skb(job) A bit more involved but pretty much the same idea. The difference is that the continuity is broken twice; once to do encryption and once to do authentication. Here is a list of notes for this diagram: * job is a buffer that stores the information about one transaction; it is used internally to store all local variables that need to be kept around between the dispatch of the operation and the matching interrupt. It is noteworthy to mention that there may be multiple jobs per tbd. This will happen most frequently on the receiver, if the sender is much faster and can swamp its counterpart. For this reason we keep a separate 'job' structure for each packet that comes into KLIPS. 7. HARDWARE ACCELERATION POLICY As briefly mentioned above you should be able to assign a crypto engine - be it software or an instance of a hardware accelerator - to a connection or even to an SA. For example say you have a massive security gateway (SG) with a fair number of static connections and a large number of possible road-warriors that may connect. You feel that the static connections are more important so you allocate a hardware engine to them but not to the %default connection. Thus each road-warrior would get the default engine (i.e. software). Say that in this box we have two accelerators named alpha and beta. Alpha is capable of performing whole packet processing for esp only. Beta is only able to perform 3des; a sample configuration could look like this: config setup hwload=alpha,beta ... conn %default doesp=software doah=software do3des=software domd5=software dosha1=software ... conn static-1 doesp=alpha ... conn static-2 do3des=beta ... conn roadwarrior ... In the above example alpha and beta are names of drivers that reside in /usr/modules/your.kernel.ver/kernel/net/ipsec along side ipsec.o. Thus after loading ipsec.o, but before setting up any tunnels, the setup script will have to check the hwload variable and load the modules listed there. These modules are most likely just wrappers around crypto drivers and will require that the main driver be loaded as well - modutils takes care of that for us. We have discussed signing crypto engines for loading into freeswan and decided against it as loading them requires root privileges. If you have root you can do it all anyway. As always... comments and critique are most appreciated. Regards, Bart Trojanowski ============================================================================= This document is a copyright of Bart Trojanowski. If available, an updated version of this document can be found at: http://www.jukie.net/~bart/linux-ipsec/