Splice, Tee & VMsplice: Zero Copy in Linux
Splice, Tee & VMsplice: Zero Copy in Linux
Unable to handle kernel paging request at virtual address 4d1b65e8
Unable to handle kernel paging request at virtual address 4d1b65e8
pgd = c0280000
pgd = c0280000
<1>[4d1b65e8] *pgd=00000000[4d1b65e8] *pgd=00000000
Internal error: Oops: f5 [#1]
Internal error: Oops: f5 [#1]
Modules linked in:Modules linked in: hx4700_udc hx4700_udc asic3_base asic3_base
CPU: 0
CPU: 0
PC is at set_pxa_fb_info+0x2c/0x44
PC is at set_pxa_fb_info+0x2c/0x44
LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc]
LR is at hx4700_udc_init+0x1c/0x38 [hx4700_udc]
pc : [<c00116c8>] lr : [<bf00901c>] Not tainted Herzelinux
sp : c076df78 ip : 60000093 fp : c076df84 http://tuxology.net
pc : [<c00116c8>] lr : [<bf00901c>] Not tainted
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 1
Rights to copy
This kit contains work by the
following authors:
Attribution – ShareAlike 2.0
You are free © Copyright 20042006
to copy, distribute, display, and perform the work Michael Opdenacker
to make derivative works
michael@freeelectrons.com
to make commercial use of the work
Under the following conditions
http://www.freeelectrons.com
Attribution. You must give the original author credit.
© Copyright 20032006
Share Alike. If you alter, transform, or build upon this work,
Oron Peled
you may distribute the resulting work only under a license
identical to this one. [email protected]
For any reuse or distribution, you must make clear to others the http://www.actcom.co.il/~oron
license terms of this work.
Any of these conditions can be waived if you get permission from © Copyright 2004 – 2008
the copyright holder.
Codefidence ltd.
Your fair use and other rights are in no way affected by the above.
[email protected]
License text: http://creativecommons.org/licenses/bysa/2.0/legalcode
http://www.codefidence.com
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 2
Kernel architecture
App1 App2 ...
User
space
C library
System call interface
Hardware
All modern CPUs support a dual mode of operation:
User mode, for regular tasks.
Supervisor (or privileged) mode, for the kernel.
The mode the CPU is in determines which instructions the CPU is
willing to execute:
“Sensitive” instructions will not be executed when the CPU is in
user mode.
The CPU mode is determined by one of the CPU registers, which stores
the current “Ring Level”
0 for supervisor mode, 3 for user mode, 12 unused by Linux.
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 4
The System Call Interface
When a user space tasks needs to use a kernel service, it will make a
“System Call”.
The C library places parameters and number of system call in registers
and then issues a special trap instruction.
The trap atomically changes the ring level to supervisor mode and the
sets the instruction pointer to the kernel.
The kernel will find the required system called via the system call table
and execute it.
Returning from the system call does not require a special instruction,
since in supervisor mode the ring level can be changed directly.
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 5
Linux System Call Path
entry.S
Glibc
Task
Task
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 6
Exchanging Data With UserSpace (1)
In kernel code, you can't just memcpy between
an address supplied by userspace and
the address of a buffer in kernelspace!
Correspond to completely different
address spaces (thanks to virtual memory).
The userspace address may be swapped out to disk.
The userspace address may be invalid
(user space process trying to access unauthorized data).
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 7
Exchanging Data With UserSpace (2)
You must use dedicated functions such as the following ones in your
read and write file operations code:
include <asm/uaccess.h>
unsigned long copy_to_user(void __user *to,
const void *from,
unsigned long n);
unsigned long copy_from_user(void *to,
const void __user *from,
unsigned long n);
Make sure that these functions return 0!
Another return value would mean that they failed.
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 8
DMA Off Load Engine
DMA (Direct Memory Access) offload engine is a piece of
hardware that does memcpy by hardware other then the CPU.
Example: Intel I/OAT (I/O Acceleration Technology).
Makes the copy the job of an entity other then the CPU.
It's zero copy, if by copy you mean copy by the CPU.
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 9
Simple Client/Server Copies
Client Server
Rx Tx Kernel
Kernel
... ...
ret = recv(s, buf) ret = send(s, buf)
... ...
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 10
Simple Client/Server Copies
Client Server
Rx Tx Kernel
Kernel
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 11
Zero Copy
Inkernel buffer that the user has control over.
The buffer is implemented as a set of referencecounted pointers which
the kernel copies around without actually copying the data.
splice() moves data to/from the buffer from/to an arbitrary file descriptor
tee() Moves data to/from one buffer to another
vmsplice() does the same than splice(), but instead of splicing from fd to
fd as splice() does, it splices from a user address range into a file.
Can be used anywhere where a process needs to send something from
one end to another, but it doesn't need to touch or even look at the data,
just forward it.
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 12
Zero Copy
Inkernel buffer that the user has control over.
Implemented as a pipe.
The pipe buffer is implemented as a set of referencecounted
pointers which the kernel copies around without actually
copying the data.
tee(), splice() and vmsplice() move data from user program to
the pipe and from one pipe to the next, without copying
Use when a process needs to send something from one end to
another, but doesn't need to touch or even look at the data.
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 13
Splice
splice(int fd_in, off_t *off_in, int fd_out, off_t *off_out,
size_t len, unsigned int flags );
splice() moves data to (from) the pipe from (to) an arbitrary
file descriptor.
sendfile() is now internally implemented as splice().
Must use SPLICE_F_MOVE flag to achieve zero copy, if
possible: buffer ref. count of zero of whole pages.
Other flags: SPLICE_F_NONBLOCK, SPLICE_F_MORE
which works like TCP_CORK.
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 14
Tee
long tee(int fd_in, int fd_out, size_t len, unsigned int
flags );
tee() moves (read: copies reference to) data to (from) one
pipe buffer to the other.
Source pipe still holds the data.
Only useful flag is SPLICE_F_NONBLOCK.
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 15
Zero Copy of Example 1
if (fd == 1) { close(fd);
perror("open"); exit(EXIT_SUCCESS);
exit(EXIT_FAILURE); }
}
do {
/*
* tee stdin to stdout.
*/
len = tee(STDIN_FILENO, STDOUT_FILENO,
INT_MAX, SPLICE_F_NONBLOCK);
if (len < 0) {
if (errno == EAGAIN)
continue;
perror("tee");
exit(EXIT_FAILURE);
...
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 17
Vmsplice
long vmsplice(int fd, const struct iovec *iov, unsigned
long nr_segs, unsigned int flags);
struct iovec {
void *iov_base; /* Starting address */
size_t iov_len; /* Number of bytes */
};
vmsplice() does the same than splice(), but instead of
splicing from fd to fd as splice() does, it splices from a user
address range into a file.
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 18
Zero Copy Vmsplice
Zero copy requires flag SPLICE_F_GIFT
The user pages are a gift to the kernel. The application
may not modify this memory ever, or page cache and on
disk data may differ.
Gifting pages to the kernel means that a subsequent
splice() SPLICE_F_MOVE can successfully move the
pages; if this flag is not specified, then a subsequent
splice() SPLICE_F_MOVE must copy the pages.
Data must also be properly page aligned, both in memory
and length.
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 19
Zero Copy of Example 2
Data
Zero Copy I: UserMode Perspective
http://www.linuxjournal.com/article/6345
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 21
Copyrights and Trademarks
© Copyright 20062004, Michael Opdenacker
© Copyright 20042008 Codefidence Ltd.
Tux Image Copyright: © 1996 Larry Ewing
Linux is a registered trademark of Linus Torvalds.
All other trademarks are property of their respective owners.
Used and distributed under a Creative Commons AttributionShareAlike 2.0 license
© Copyright 20062004, Michael Opdenacker For full copyright information see last page.
© Copyright 20032006, Oron Peled Creative Commons AttributionShareAlike 2.0 license
© Copyright 20042006 Codefidence Ltd. 22