| Enabling High Performance Bulk Data Transfers With SSH |
| Chris Rapier | |
| Benjamin Bennett | |
| Pittsburgh Supercomputing Center | |
| TIP Ô08 |
| Moving Data |
| Still crazy after all these years | ||||
| Multiple solutions exist | ||||
| Protocols | ||||
| UDT, SABUL, etcÉ | ||||
| Implementations | ||||
| GridFTP, kFTP, bbFTP, hand rolled and moreÉ | ||||
| Not to mention | ||||
| Advanced congestion control, autotuning, jumbograms, etcÉ | ||||
| Many Solutions No Answers |
| All developed as a solution to the same problem | ||
| Moving lots of a data very fast can be very difficult | ||
| Unfortunately, no single solution meets all needs. | ||
| Fast, easy to use, inexpensive to maintain, flexible, secure | ||
| What About SSH? |
| Easy to use. | |
| Cheap to maintain. | |
| Installed everywhere. | |
| Flexible. | |
| Strong cryptography. |
| Why not SSH? |
| It can be really really slow. |
| How slow? |
| A little better |
| What changed? |
| Why the improvement in OpenSSH4.7? | |||
| SSH is a multiplexed application | |||
| Each channel requires its own flow control which is implemented as a receive window | |||
| In 4.7 the maximum window size was increased to ~1MiB up from 64KiB | |||
| Windows |
| Receive windows advertise the amount of data a system or application is willing to accept per round trip time. | ||
| Effective window size is the minimum of all windows; protocol and application. | ||
| Each window must be tuned and in sync to maximize throughput. | ||
| If any one is out of tune the entire connection will suffer. | ||
| Slide 10 |
| Slide 11 |
| Slide 12 |
| Slide 13 |
| Slide 14 |
| Slide 15 |
| Windows in HPN-SSH |
| Dynamically defined receive window size grows to match the TCP window. | ||
| Set to TCP RWIN on start. | ||
| Grows with RWIN if autotuning system. | ||
| Dynamic sizing reduces issues of over-buffering problems. | ||
| Slide 17 |
| Slide 18 |
| Slide 19 |
| Slide 20 |
| SFTP is Special |
| SFTP adds *another* layer of flow control. | ||
| All SFTP packets are treated as requests | ||
| By default no more than 16 outstanding requests. | ||
| Results in a 512KiB window | ||
| Increase using -R on command line | ||
| Slide 22 |
| A lot better |
| ButÉ |
| As the throughput increases crypto demands more of the processor. | ||
| The transfer is now processor bound | ||
| We Need More Power? |
| Two solutions to processor bound transfers | |||
| Throw more processing power at the problem | |||
| Do the work more efficiently | |||
| Define ÔworkÕ | |||
| The None Switch |
| Many people only need secure authentication. The data can pass in the clear. | ||
| HPN-SSH allows users to switch to a ÔNoneÕ cipher after authentication. | ||
| Done! |
| As far as we can go? |
| Windows are already optimized. | ||
| No more real improvements available there | ||
| NONE cipher is limited to a subset of transfers. | ||
| Sometimes you absolutely need full encryption. | ||
| So what now? | ||
| More Power |
| Common assumption that current hardware is incapable of meeting crypto demand | ||
| Is it true? | ||
| Slide 30 |
| Today's Hardware |
| Laptop | ||
| Two 64bit general purpose cores | ||
| 1GiB to 4GiB RAM | ||
| 1Gbps ethernet | ||
| Desktop/Workstation | ||
| Two to eight 64bit general purpose cores | ||
| 1GiB to 8GiB RAM | ||
| 1Gbps ethernet | ||
| OpenSSL Benchmarks |
| "hmac-md5 @ 1Gbps," |
| hmac-md5 @ 1Gbps, ~0.3 cores | |
| aes256-cbc @ 1Gbps, ~1.34 cores | |
| Crypto total @ 1Gbps, ~1.64 cores | |
| We have 8! |
| "MAC requires fraction of one..." |
| MAC requires fraction of one core | |
| Cipher requires more than one core | |
| MAC, cipher, and more all within a single execution thread |
| "Multi-threading on functional boundaries" |
| Multi-threading on functional boundaries | |||
| Perform MAC and cipher on a packet concurrently | |||
| Possible on sender, not on receiver | |||
| Process multiple packets concurrently (pipeline) | |||
| Cipher still needs more than one core | |||
| Multi-threading within cipher | |||
| Can it be parallelized? | |||
| SSH Cipher Modes |
| CBC | ||
| Most common | ||
| RFC 4253 ÒThe Secure Shell (SSH) Transport Layer ProtocolÓ specifies only CBC mode ciphers, arcfour, and none. | ||
| CTR | ||
| Specified in RFC 4344 ÒSSH Transport Layer Encryption ModesÓ | ||
| More desirable security properties than CBC | ||
| "Cipher Block Chaining Mode Encryption" |
| Cipher Block Chaining Mode Encryption |
| "Cipher Block Chaining Mode Decryption" |
| Cipher Block Chaining Mode Decryption |
| "Encrypt must be serial" |
| Encrypt must be serial | |
| Decrypt may be parallel | |
| That doesn't help so much :-( |
| "Counter Mode Encryption" |
| Counter Mode Encryption |
| "Counter Mode Decryption" |
| Counter Mode Decryption |
| "Encrypt may be parallel" |
| Encrypt may be parallel | |
| Decrypt may be parallel | |
| Keystream can be pregenerated | |
| LetÕs get to workÉ |
| "Uses arbitrary number of cipher..." |
| Uses arbitrary number of cipher threads (and cores) to generate a single keystream. | |
| Cipher threads pre-generate keystream, starting once a cipher context key and IV are known. | |
| Leaves only keystream dequeue & XOR for encrypt/decrypt operations in main SSH thread. |
| Single Cipher Thread |
| Cipher Thread | ||
| AES_Encrypt(ctr) | ||
| Inc(ctr) | ||
| Main Thread | ||
| read(disk) | ||
| Packetize | ||
| Compute MAC | ||
| XOR | ||
| write(net) | ||
| Multiple Cipher Threads |
| Ring of bounded queues | ||
| Each queue holds a portion of keystream | ||
| Each queue exclusively accessed | ||
| Queue counters offset initially and each fill | ||
| M-T AES-CTR Results |
| Conclusion |
| SSH designed for security | ||
| HPN-SSH is performance enhancements to the most common SSH implementation, OpenSSH | ||
| High throughput with high latency | ||
| Kernel auto-tuning adjusts TCP flow contol | ||
| HPN-SSH RecvBufferPolling adjusts SSH flow control | ||
| High throughput with any latency | ||
| HPN-SSH None cipher for non-private data | ||
| HPN-SSH Multi-threaded AES-CTR cipher | ||
| Future Work |
| Approaching 10Gbps | ||
| Continued multi-threading | ||
| Concurrent packet processing/pipelining | ||
| Efficiency | ||
| Striped data transfers | ||
| Exotic architectures | ||
| Where to get it |
| http://www.psc.edu/networking/projects/hpn-ssh | |
| Email: hpnssh@psc.edu |