从网上看到的, 绝对是权威资料:)
- From: Linus Torvalds
- Newsgroups: fa.linux.kernel
- Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
- Date: Mon, 01 Aug 2005 20:12:32 UTC
-
Message-ID:
-
Original-Message-ID:
- On Mon, 1 Aug 2005, Hugh Dickins wrote:
- >
- > > Aside, that brings up an interesting question - why should readonly
- > > mappings of writeable files (with VM_MAYWRITE set) disallow ptrace
- > > write access while readonly mappings of readonly files not? Or am I
- > > horribly confused?
- >
- > Either you or I. You'll have to spell that out to me in more detail,
- > I don't see it that way.
- We have always just done a COW if it's read-only - even if it's shared.
- The point being that if a process mapped did a read-only mapping, and a
- tracer wants to modify memory, the tracer is always allowed to do so, but
- it's _not_ going to write anything back to the filesystem. Writing
- something back to an executable just because the user happened to mmap it
- with MAP_SHARED (but read-only) _and_ the user had the right to write to
- that fd is _not_ ok.
- So VM_MAYWRITE is totally immaterial. We _will_not_write_ (and must not do
- so) to the backing store through ptrace unless it was literally a writable
- mapping (in which case VM_WRITE will be set, and the page table should be
- marked writable in the first case).
- So we have two choices:
- - not allow the write at all in ptrace (which I think we did at some
- point)
- This ends up being really inconvenient, and people seem to really
- expect to be able to write to readonly areas in debuggers. And doing
- "MAP_SHARED, PROT_READ" seems to be a common thing (Linux has supported
- that pretty much since day #1 when mmap was supported - long before
- writable shared mappings were supported, Linux accepted MAP_SHARED +
- PROT_READ not just because we could, but because Unix apps do use it).
- or
- - turn a shared read-only page into a private page on ptrace write
- This is what we've been doing. It's strange, and it _does_ change
- semantics (it's not shared any more, so the debugger writing to it
- means that now you don't see changes to that file by others), so it's
- clearly not "correct" either, but it's certainly a million times better
- than writing out breakpoints to shared files..
- At some point (for the longest time), when a debugger was used to modify a
- read-only page, we also made it writable to the user, which was much
- easier from a VM standpoint. Now we have this "maybe_mkwrite()" thing,
- which is part of the reason for this particular problem.
- Using the dirty flag for a "page is _really_ writable" is admittedly kind
- of hacky, but it does have the advantage of working even when the -real-
- write bit isn't set due to "maybe_mkwrite()". If it forces the s390 people
- to add some more hacks for their strange VM, so be it..
- [ Btw, on a totally unrelated note: anybody who is a git user and looks
- for when this maybe_mkwrite() thing happened, just doing
- git-whatchanged -p -Smaybe_mkwrite mm/memory.c
- in the bkcvs conversion pinpoints it immediately. Very useful git trick
- in case you ever have that kind of question. ]
- I added Martin Schwidefsky to the Cc: explicitly, so that he can ping
- whoever in the s390 team needs to figure out what the right thing is for
- s390 and the dirty bit semantic change. Thanks for pointing it out.
- Linus
-
From: Linus Torvalds
- Newsgroups: fa.linux.kernel
- Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
- Date: Mon, 01 Aug 2005 22:00:09 UTC
-
Message-ID:
-
Original-Message-ID:
- On Mon, 1 Aug 2005, Hugh Dickins wrote:
- > >
- > > We have always just done a COW if it's read-only - even if it's shared.
- > >
- > > The point being that if a process mapped did a read-only mapping, and a
- > > tracer wants to modify memory, the tracer is always allowed to do so, but
- > > it's _not_ going to write anything back to the filesystem. Writing
- > > something back to an executable just because the user happened to mmap it
- > > with MAP_SHARED (but read-only) _and_ the user had the right to write to
- > > that fd is _not_ ok.
- >
- > I'll need to think that through, but not right now. It's a surprise
- > to me, and it's likely to surprise the current kernel too.
- Well, even if you did the write-back if VM_MAYWRITE is set, you'd still
- have the case of having MAP_SHARED, PROT_READ _without_ VM_MAYWRITE being
- set, and I'd expect that to actually be the common one (since you'd
- normally use O_RDONLY to open a fd that you only want to map for reading).
- And as mentioned, MAP_SHARED+PROT_READ does actually happen in real life.
- Just do a google search on "MAP_SHARED PROT_READ -PROT_WRITE" and you'll
- get tons of hits. For good reason too - because MAP_PRIVATE isn't actually
- coherent on several old UNIXes.
- So you'd still have to convert such a case to a COW mapping, so it's not
- like you can avoid it.
- Of course, if VM_MAYWRITE is not set, you could just convert it silently
- to a MAP_PRIVATE at the VM level (that's literally what we used to do,
- back when we didn't support writable shared mappings at all, all those
- years ago), so at least now the COW behaviour would match the vma_flags.
- > I'd prefer to say that if the executable was mapped shared from a writable fd,
- > then the tracer will write back to it; but you're clearly against that.
- Absolutely. I can just see somebody mapping an executable MAP_SHARED and
- PROT_READ, and something as simple as doing a breakpoint while debugging
- causing system-wide trouble.
- I really don't think that's acceptable.
- And I'm not making it up - add PROT_EXEC to the google search around, and
- watch it being done exactly that way. Several of the hits mention shared
- libraries too.
- I strongly suspect that almost all cases will be opened with O_RDONLY, but
- still..
- Linus
-
From: Linus Torvalds
- Newsgroups: fa.linux.kernel
- Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
- Date: Mon, 01 Aug 2005 22:10:20 UTC
-
Message-ID:
-
Original-Message-ID:
- On Mon, 1 Aug 2005, Linus Torvalds wrote:
- >
- > Of course, if VM_MAYWRITE is not set, you could just convert it silently
- > to a MAP_PRIVATE at the VM level (that's literally what we used to do,
- > back when we didn't support writable shared mappings at all, all those
- > years ago), so at least now the COW behaviour would match the vma_flags.
- Heh. I just checked. We still do exactly that:
- if (!(file->f_mode & FMODE_WRITE))
- vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
- some code never dies ;)
- However, we still set the VM_MAYSHARE bit, and thats' the one that
- mm/rmap.c checks for some reason. I don't see quite why - VM_MAYSHARE
- doesn't actually ever do anything else than make sure that we try to
- allocate a mremap() mapping in a cache-coherent space, I think (ie it's a
- total no-op on any sane architecture, and as far as rmap is concerned on
- all of them).
- Linus
从里面的字缝里看出点内容来, 或者说, 理解了点东西:
映射的实现, 区分了 back storage的读写权限和 map 本身的读写权限
前置是 mmap 前 open 时决定的, 也就是 O_RDONLY 之类的经过转换保存的 file->f_mode 中的东西, FMODE_XXX; 后者既是 mmap 时传递的 PROT_XXX; 还有一类就是 mmap 时的 flag, MAP_XXX, 和权限相关的主要是 MAP_SHARED 和 MAP_PRIVATE.
先理解以下 VM_MAYXXX系列,
- readonly mappings of writeable files (with VM_MAYWRITE set)
那么, 所谓可写文件的只读映射, 就是 PROT_READ (或者 VM_READ) | VM_MAYWRITE
那么, 所谓只读文件的只读映射, 应该就是 PROT_READ (或者 VM_READ) | VM_MAYREAD 了
MAP_SHARED | PROT_READ | VM_MAYWRITE 这类组合该怎么处理? 这应该是上面资料的论题了, 不清楚其他方面, 上面资料说的是 ptrace 是允许写的, 因为 gdb 需要写 READ ONLY 的区域, 比如在代码段加断点; 但不允许同步到磁盘, 理由就是 PROT_READ 是没有写权限的; 实现方法是这类组合将 MAP_SHARED 忽略掉而成为 COW 页面; 不过这样又和 MAP_SHARED 矛盾了 so it's clearly not "correct" either, but it's certainly a million times better than writing out breakpoints to shared files
。
相关代码:
vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (!vma)
return -ENOMEM;
vma->vm_mm = mm;
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags(prot,flags) | mm->def_flags;
if (file) {
VM_ClearReadHint(vma);
vma->vm_raend = 0;
if (file->f_mode & FMODE_READ)
vma->vm_flags |= VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
if (flags & MAP_SHARED) {
vma->vm_flags |= VM_SHARED | VM_MAYSHARE;
/* This looks strange, but when we don't have the file open
* for writing, we can demote the shared mapping to a simpler
* private mapping. That also takes care of a security hole
* with ptrace() writing to a shared mapping without write
* permissions.
*
* We leave the VM_MAYSHARE bit on, just to get correct output
* from /proc/xxx/maps..
*/
if (!(file->f_mode & FMODE_WRITE))
vma->vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
}
} else {
vma->vm_flags |= VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
if (flags & MAP_SHARED)
vma->vm_flags |= VM_SHARED | VM_MAYSHARE;
}
vma->vm_page_prot = protection_map[vma->vm_flags & 0x0f];
vma->vm_ops = NULL;
vma->vm_pgoff = pgoff;
vma->vm_file = NULL;
vma->vm_private_data = NULL;
/proc/pid/maps 输出类似 rwxp(or s) 之类, 分别就是可读,写, 执行,私有的或共享的
再加点理解: 为何说 COW 就能防止写入磁盘了? 机制如下:
COW 的时候, 新分配的页面没有加入 vma->vm_file 的 address_space, 因此, page->mapping 是空, 因此, swap 的时候, 这个页面只是作为匿名页面写入交换区, 而不是文件中。
代码如下:
- static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
-
unsigned long address, pte_t *page_table, pte_t pte)
-
{
-
struct page *old_page, *new_page;
-
-
old_page = pte_page(pte);
-
if (!VALID_PAGE(old_page))
-
goto bad_wp_page;
-
-
/*
-
* We can avoid the copy if:
-
* - we're the only user (count == 1)
-
* - the only other user is the swap cache,
-
* and the only swap cache user is itself,
-
* in which case we can just continue to
-
* use the same swap cache (it will be
-
* marked dirty).
-
*/
-
switch (page_count(old_page)) {
-
case 2:
-
/*
-
* Lock the page so that no one can look it up from
-
* the swap cache, grab a reference and start using it.
-
* Can not do lock_page, holding page_table_lock.
-
*/
-
if (!PageSwapCache(old_page) || TryLockPage(old_page))
-
break;
-
if (is_page_shared(old_page)) {
-
UnlockPage(old_page);
-
break;
-
}
-
UnlockPage(old_page);
-
/* FallThrough */
-
case 1:
-
flush_cache_page(vma, address);
-
establish_pte(vma, address, page_table, pte_mkyoung(pte_mkdirty(pte_mkwrite(pte))));
-
spin_unlock(&mm->page_table_lock);
-
return 1; /* Minor fault */
-
}
-
-
/*
-
* Ok, we need to copy. Oh, well..
-
*/
-
spin_unlock(&mm->page_table_lock);
-
new_page = page_cache_alloc();
-
if (!new_page)
-
return -1;
-
spin_lock(&mm->page_table_lock);
-
-
/*
-
* Re-check the pte - we dropped the lock
-
*/
-
if (pte_same(*page_table, pte)) {
-
if (PageReserved(old_page))
-
++mm->rss;
-
break_cow(vma, old_page, new_page, address, page_table);
-
-
/* Free the old page.. */
-
new_page = old_page;
-
}
-
spin_unlock(&mm->page_table_lock);
-
page_cache_release(new_page);
-
return 1; /* Minor fault */
-
-
bad_wp_page:
-
spin_unlock(&mm->page_table_lock);
-
printk("do_wp_page: bogus page at address %08lx (page 0x%lx)\n",address,(unsigned long)old_page);
-
return -1;
- }