| This revision is to address two problems found by Horst H. von Brand while |
| testing the v2 patches in Fedora: |
| https://bugzilla.redhat.com/show_bug.cgi?id=637647 |
| On his machine, we don't use _CRS by default, and the BIOS left some bridge |
| windows disabled. |
| |
| Problem 1: When we assigned space for the windows, we started at the top |
| and allocated [mem 0xffffffffffe00000-0xffffffffffffffff], which is |
| obviously useless because the CPU doesn't support physical addresses that |
| large. |
| |
| Problem 2: Subsequent allocations failed because I made an error in |
| find_resource(). We look for available space from [child->end + 1 to |
| root->end], and if the last child ends exactly at 0xffffffffffffffff, we |
| wrap around and start from zero. |
| |
| I made the top-down allocation conditional: an arch can select it at |
| boot-time, and there's a kernel command line option to change it for |
| debugging. |
| |
| |
| When we move PCI devices, we currently allocate space bottom-up, i.e., we look |
| at PCI bus resources in the order we found them, we look at gaps between child |
| resources bottom-up, and we align the new space at the bottom of an available |
| region. |
| |
| On x86, we move PCI devices more than we used to because we now pay attention |
| to the PCI host bridge windows from ACPI. For example, when we find a device |
| that's outside all the known host bridge windows, we try to move it into a |
| window, and we look for space starting at the bottom. |
| |
| Windows does similar device moves, but it looks for space top-down rather than |
| bottom-up. Since most machines are better-tested with Windows than Linux, this |
| difference means that Linux is more likely to trip over BIOS bugs in the PCI |
| host bridge window descriptions than Windows is. |
| |
| We've had several reports of Dell machines where the BIOS leaves the AHCI |
| controller outside the host bridge windows (BIOS bug #1), *and* the lowest |
| host bridge window includes an area that doesn't actually reach PCI (BIOS |
| bug #2). The result is that Windows (which moves AHCI to the top of a window) |
| works fine, while Linux (which moves AHCI to the bottom, buggy, area) doesn't |
| work. |
| |
| These patches change Linux to allocate space more like Windows does: |
| |
| 1) The x86 pcibios_align_resource() will choose space from the |
| end of an available area, not the beginning. |
| |
| 2) In the generic allocate_resource() path, we'll look for space |
| between existing children from the top, not from the bottom. |
| |
| 3) When pci_bus_alloc_resource() looks for available space, it |
| will start from the highest window, not the first one we found. |
| |
| This series fixes a 2.6.34 regression that prevents many Dell Precision |
| workstations from booting: |
| |
| https://bugzilla.kernel.org/show_bug.cgi?id=16228 |
| |
| Changes from v3 to v4: |
| - Use round_down() rather than adding ALIGN_DOWN(). |
| - Replace ARCH_HAS_TOP_DOWN_ALLOC #define with a boot-time architecture |
| choice and add a "resource_alloc_from_bottom" command line option to |
| revert to the old behavior (NOTE: this only affects allocate_resource(), |
| not pcibios_align_resource() or pci_bus_alloc_resource()). |
| - Fixed find_resource_from_top() again; it still didn't handle a child |
| that ended at the parent's end correctly. |
| |
| Changes from v2 to v3: |
| - Updated iomem_resource.end to reflect the end of usable physical address |
| space. Otherwise, we might allocate right up to 0xffffffff_ffffffff, |
| which isn't usable. |
| - Make allocate_resource() change conditional on ARCH_HAS_TOP_DOWN_ALLOC. |
| Without arch-specific changes like the above, it's too dangerous to |
| make this change for everybody at once. |
| - Fix 64-bit wraparound in find_resource(). If the last child happened |
| to end at ~0, we computed the highest available space as [child->end + 1, |
| root->end], which makes us think the available space started at 0, |
| which makes us return space that may already be allocated. |
| |
| Changes from v1 to v2: |
| - Moved check for allocating before the available area from |
| pcibios_align_resource() to find_resource(). Better to do it |
| after the alignment callback is done, and make it generic. |
| - Fixed pcibios_align_resource() alignment. If we start from the |
| end of the available area, we must align *downward*, not upward. |
| - Fixed pcibios_align_resource() ISA alias avoidance. Again, since |
| the starting point is the end of the area, we must align downward |
| when we avoid aliased areas. |
| |
| |
| Bjorn Helgaas (6): |
| resources: ensure alignment callback doesn't allocate below available start |
| resources: support allocating space within a region from the top down |
| PCI: allocate bus resources from the top down |
| x86/PCI: allocate space from the end of a region, not the beginning |
| x86: update iomem_resource end based on CPU physical address capabilities |
| x86: allocate space within a region top-down |
| |
| |
| Documentation/kernel-parameters.txt | 5 ++ |
| arch/x86/kernel/setup.c | 2 + |
| arch/x86/pci/i386.c | 17 ++++-- |
| drivers/pci/bus.c | 53 +++++++++++++++++-- |
| include/linux/ioport.h | 1 |
| kernel/resource.c | 99 ++++++++++++++++++++++++++++++++++- |
| 6 files changed, 163 insertions(+), 14 deletions(-) |
| -- |
| To unsubscribe from this list: send the line "unsubscribe linux-pci" in |
| the body of a message to majordomo@vger.kernel.org |
| More majordomo info at http://vger.kernel.org/majordomo-info.html |
| The alignment callback returns a proposed location, which may have been |
| adjusted to avoid ISA aliases or for other architecture-specific reasons. |
| We already had a check ("tmp.start < tmp.end") to make sure the callback |
| doesn't return a location above the available area. |
| |
| This patch adds a check to make sure the callback doesn't return something |
| *below* the available area, as may happen if the callback tries to allocate |
| top-down. |
| |
| Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> |
| |
| |
| kernel/resource.c | 10 ++++++++-- |
| 1 files changed, 8 insertions(+), 2 deletions(-) |
| |
| |
| diff --git a/kernel/resource.c b/kernel/resource.c |
| index 7b36976..ace2269 100644 |
| |
| |
| @@ -371,6 +371,7 @@ static int find_resource(struct resource *root, struct resource *new, |
| { |
| struct resource *this = root->child; |
| struct resource tmp = *new; |
| + resource_size_t start; |
| |
| tmp.start = root->start; |
| /* |
| @@ -391,8 +392,13 @@ static int find_resource(struct resource *root, struct resource *new, |
| if (tmp.end > max) |
| tmp.end = max; |
| tmp.start = ALIGN(tmp.start, align); |
| - if (alignf) |
| - tmp.start = alignf(alignf_data, &tmp, size, align); |
| + if (alignf) { |
| + start = alignf(alignf_data, &tmp, size, align); |
| + if (tmp.start <= start && start <= tmp.end) |
| + tmp.start = start; |
| + else |
| + tmp.start = tmp.end; |
| + } |
| if (tmp.start < tmp.end && tmp.end - tmp.start >= size - 1) { |
| new->start = tmp.start; |
| new->end = tmp.start + size - 1; |
| |
| -- |
| To unsubscribe from this list: send the line "unsubscribe linux-pci" in |
| the body of a message to majordomo@vger.kernel.org |
| More majordomo info at http://vger.kernel.org/majordomo-info.html |
| Allocate space from the top of a region first, then work downward, |
| if an architecture desires this. |
| |
| When we allocate space from a resource, we look for gaps between children |
| of the resource. Previously, we always looked at gaps from the bottom up. |
| For example, given this: |
| |
| [mem 0xbff00000-0xf7ffffff] PCI Bus 0000:00 |
| [mem 0xbff00000-0xbfffffff] gap -- available |
| [mem 0xc0000000-0xdfffffff] PCI Bus 0000:02 |
| [mem 0xe0000000-0xf7ffffff] gap -- available |
| |
| we attempted to allocate from the [mem 0xbff00000-0xbfffffff] gap first, |
| then the [mem 0xe0000000-0xf7ffffff] gap. |
| |
| With this patch an architecture can choose to allocate from the top gap |
| [mem 0xe0000000-0xf7ffffff] first. |
| |
| We can't do this across the board because iomem_resource.end is initialized |
| to 0xffffffff_ffffffff on 64-bit architectures, and most machines can't |
| address the entire 64-bit physical address space. Therefore, we only |
| allocate top-down if the arch requests it by clearing |
| "resource_alloc_from_bottom". |
| |
| Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> |
| |
| |
| Documentation/kernel-parameters.txt | 5 ++ |
| include/linux/ioport.h | 1 |
| kernel/resource.c | 89 +++++++++++++++++++++++++++++++++++ |
| 3 files changed, 94 insertions(+), 1 deletions(-) |
| |
| |
| diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt |
| index 8dd7248..fe50cbd 100644 |
| |
| |
| @@ -2156,6 +2156,11 @@ and is between 256 and 4096 characters. It is defined in the file |
| reset_devices [KNL] Force drivers to reset the underlying device |
| during initialization. |
| |
| + resource_alloc_from_bottom |
| + Allocate new resources from the beginning of available |
| + space, not the end. If you need to use this, please |
| + report a bug. |
| + |
| resume= [SWSUSP] |
| Specify the partition device for software suspend |
| |
| diff --git a/include/linux/ioport.h b/include/linux/ioport.h |
| index b227902..d377ea8 100644 |
| |
| |
| @@ -112,6 +112,7 @@ struct resource_list { |
| /* PC/ISA/whatever - the normal PC address spaces: IO and memory */ |
| extern struct resource ioport_resource; |
| extern struct resource iomem_resource; |
| +extern int resource_alloc_from_bottom; |
| |
| extern struct resource *request_resource_conflict(struct resource *root, struct resource *new); |
| extern int request_resource(struct resource *root, struct resource *new); |
| diff --git a/kernel/resource.c b/kernel/resource.c |
| index ace2269..8d337a9 100644 |
| |
| |
| @@ -40,6 +40,23 @@ EXPORT_SYMBOL(iomem_resource); |
| |
| static DEFINE_RWLOCK(resource_lock); |
| |
| +/* |
| + * By default, we allocate free space bottom-up. The architecture can request |
| + * top-down by clearing this flag. The user can override the architecture's |
| + * choice with the "resource_alloc_from_bottom" kernel boot option, but that |
| + * should only be a debugging tool. |
| + */ |
| +int resource_alloc_from_bottom = 1; |
| + |
| +static __init int setup_alloc_from_bottom(char *s) |
| +{ |
| + printk(KERN_INFO |
| + "resource: allocating from bottom-up; please report a bug\n"); |
| + resource_alloc_from_bottom = 1; |
| + return 0; |
| +} |
| +early_param("resource_alloc_from_bottom", setup_alloc_from_bottom); |
| + |
| static void *r_next(struct seq_file *m, void *v, loff_t *pos) |
| { |
| struct resource *p = v; |
| @@ -358,7 +375,74 @@ int __weak page_is_ram(unsigned long pfn) |
| } |
| |
| /* |
| + * Find the resource before "child" in the sibling list of "root" children. |
| + */ |
| +static struct resource *find_sibling_prev(struct resource *root, struct resource *child) |
| +{ |
| + struct resource *this; |
| + |
| + for (this = root->child; this; this = this->sibling) |
| + if (this->sibling == child) |
| + return this; |
| + |
| + return NULL; |
| +} |
| + |
| +/* |
| + * Find empty slot in the resource tree given range and alignment. |
| + * This version allocates from the end of the root resource first. |
| + */ |
| +static int find_resource_from_top(struct resource *root, struct resource *new, |
| + resource_size_t size, resource_size_t min, |
| + resource_size_t max, resource_size_t align, |
| + resource_size_t (*alignf)(void *, |
| + const struct resource *, |
| + resource_size_t, |
| + resource_size_t), |
| + void *alignf_data) |
| +{ |
| + struct resource *this; |
| + struct resource tmp = *new; |
| + resource_size_t start; |
| + |
| + tmp.start = root->end; |
| + tmp.end = root->end; |
| + |
| + this = find_sibling_prev(root, NULL); |
| + for (;;) { |
| + if (this) { |
| + if (this->end < root->end) |
| + tmp.start = this->end + 1; |
| + } else |
| + tmp.start = root->start; |
| + if (tmp.start < min) |
| + tmp.start = min; |
| + if (tmp.end > max) |
| + tmp.end = max; |
| + tmp.start = ALIGN(tmp.start, align); |
| + if (alignf) { |
| + start = alignf(alignf_data, &tmp, size, align); |
| + if (tmp.start <= start && start <= tmp.end) |
| + tmp.start = start; |
| + else |
| + tmp.start = tmp.end; |
| + } |
| + if (tmp.start < tmp.end && tmp.end - tmp.start >= size - 1) { |
| + new->start = tmp.start; |
| + new->end = tmp.start + size - 1; |
| + return 0; |
| + } |
| + if (!this || this->start == root->start) |
| + break; |
| + tmp.end = this->start - 1; |
| + this = find_sibling_prev(root, this); |
| + } |
| + return -EBUSY; |
| +} |
| + |
| +/* |
| * Find empty slot in the resource tree given range and alignment. |
| + * This version allocates from the beginning of the root resource first. |
| */ |
| static int find_resource(struct resource *root, struct resource *new, |
| resource_size_t size, resource_size_t min, |
| @@ -435,7 +519,10 @@ int allocate_resource(struct resource *root, struct resource *new, |
| int err; |
| |
| write_lock(&resource_lock); |
| - err = find_resource(root, new, size, min, max, align, alignf, alignf_data); |
| + if (resource_alloc_from_bottom) |
| + err = find_resource(root, new, size, min, max, align, alignf, alignf_data); |
| + else |
| + err = find_resource_from_top(root, new, size, min, max, align, alignf, alignf_data); |
| if (err >= 0 && __request_resource(root, new)) |
| err = -EBUSY; |
| write_unlock(&resource_lock); |
| |
| -- |
| To unsubscribe from this list: send the line "unsubscribe linux-pci" in |
| the body of a message to majordomo@vger.kernel.org |
| More majordomo info at http://vger.kernel.org/majordomo-info.html |
| Allocate space from the highest-address PCI bus resource first, then work |
| downward. |
| |
| Previously, we looked for space in PCI host bridge windows in the order |
| we discovered the windows. For example, given the following windows |
| (discovered via an ACPI _CRS method): |
| |
| pci_root PNP0A03:00: host bridge window [mem 0x000a0000-0x000bffff] |
| pci_root PNP0A03:00: host bridge window [mem 0x000c0000-0x000effff] |
| pci_root PNP0A03:00: host bridge window [mem 0x000f0000-0x000fffff] |
| pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xf7ffffff] |
| pci_root PNP0A03:00: host bridge window [mem 0xff980000-0xff980fff] |
| pci_root PNP0A03:00: host bridge window [mem 0xff97c000-0xff97ffff] |
| pci_root PNP0A03:00: host bridge window [mem 0xfed20000-0xfed9ffff] |
| |
| we attempted to allocate from [mem 0x000a0000-0x000bffff] first, then |
| [mem 0x000c0000-0x000effff], and so on. |
| |
| With this patch, we allocate from [mem 0xff980000-0xff980fff] first, then |
| [mem 0xff97c000-0xff97ffff], [mem 0xfed20000-0xfed9ffff], etc. |
| |
| Allocating top-down follows Windows practice, so we're less likely to |
| trip over BIOS defects in the _CRS description. |
| |
| On the machine above (a Dell T3500), the [mem 0xbff00000-0xbfffffff] region |
| doesn't actually work and is likely a BIOS defect. The symptom is that we |
| move the AHCI controller to 0xbff00000, which leads to "Boot has failed, |
| sleeping forever," a BUG in ahci_stop_engine(), or some other boot failure. |
| |
| Reference: https://bugzilla.kernel.org/show_bug.cgi?id=16228#c43 |
| Reference: https://bugzilla.redhat.com/show_bug.cgi?id=620313 |
| Reference: https://bugzilla.redhat.com/show_bug.cgi?id=629933 |
| Reported-by: Brian Bloniarz <phunge0@hotmail.com> |
| Reported-and-tested-by: Stefan Becker <chemobejk@gmail.com> |
| Reported-by: Denys Vlasenko <dvlasenk@redhat.com> |
| Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> |
| |
| |
| drivers/pci/bus.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++----- |
| 1 files changed, 48 insertions(+), 5 deletions(-) |
| |
| |
| diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c |
| index 7f0af0e..172bf26 100644 |
| |
| |
| @@ -64,6 +64,49 @@ void pci_bus_remove_resources(struct pci_bus *bus) |
| } |
| } |
| |
| +/* |
| + * Find the highest-address bus resource below the cursor "res". If the |
| + * cursor is NULL, return the highest resource. |
| + */ |
| +static struct resource *pci_bus_find_resource_prev(struct pci_bus *bus, |
| + unsigned int type, |
| + struct resource *res) |
| +{ |
| + struct resource *r, *prev = NULL; |
| + int i; |
| + |
| + pci_bus_for_each_resource(bus, r, i) { |
| + if (!r) |
| + continue; |
| + |
| + if ((r->flags & IORESOURCE_TYPE_BITS) != type) |
| + continue; |
| + |
| + /* If this resource is at or past the cursor, skip it */ |
| + if (res) { |
| + if (r == res) |
| + continue; |
| + if (r->end > res->end) |
| + continue; |
| + if (r->end == res->end && r->start > res->start) |
| + continue; |
| + } |
| + |
| + if (!prev) |
| + prev = r; |
| + |
| + /* |
| + * A small resource is higher than a large one that ends at |
| + * the same address. |
| + */ |
| + if (r->end > prev->end || |
| + (r->end == prev->end && r->start > prev->start)) |
| + prev = r; |
| + } |
| + |
| + return prev; |
| +} |
| + |
| /** |
| * pci_bus_alloc_resource - allocate a resource from a parent bus |
| * @bus: PCI bus |
| @@ -89,9 +132,10 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res, |
| resource_size_t), |
| void *alignf_data) |
| { |
| - int i, ret = -ENOMEM; |
| + int ret = -ENOMEM; |
| struct resource *r; |
| resource_size_t max = -1; |
| + unsigned int type = res->flags & IORESOURCE_TYPE_BITS; |
| |
| type_mask |= IORESOURCE_IO | IORESOURCE_MEM; |
| |
| @@ -99,10 +143,9 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res, |
| if (!(res->flags & IORESOURCE_MEM_64)) |
| max = PCIBIOS_MAX_MEM_32; |
| |
| - pci_bus_for_each_resource(bus, r, i) { |
| - if (!r) |
| - continue; |
| - |
| + /* Look for space at highest addresses first */ |
| + r = pci_bus_find_resource_prev(bus, type, NULL); |
| + for ( ; r; r = pci_bus_find_resource_prev(bus, type, r)) { |
| /* type_mask must match */ |
| if ((res->flags ^ r->flags) & type_mask) |
| continue; |
| |
| -- |
| To unsubscribe from this list: send the line "unsubscribe linux-pci" in |
| the body of a message to majordomo@vger.kernel.org |
| More majordomo info at http://vger.kernel.org/majordomo-info.html |
| Allocate from the end of a region, not the beginning. |
| |
| For example, if we need to allocate 0x800 bytes for a device on bus |
| 0000:00 given these resources: |
| |
| [mem 0xbff00000-0xdfffffff] PCI Bus 0000:00 |
| [mem 0xc0000000-0xdfffffff] PCI Bus 0000:02 |
| |
| the available space at [mem 0xbff00000-0xbfffffff] is passed to the |
| alignment callback (pcibios_align_resource()). Prior to this patch, we |
| would put the new 0x800 byte resource at the beginning of that available |
| space, i.e., at [mem 0xbff00000-0xbff007ff]. |
| |
| With this patch, we put it at the end, at [mem 0xbffff800-0xbfffffff]. |
| |
| Reference: https://bugzilla.kernel.org/show_bug.cgi?id=16228#c41 |
| Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> |
| |
| |
| arch/x86/pci/i386.c | 17 +++++++++++------ |
| 1 files changed, 11 insertions(+), 6 deletions(-) |
| |
| |
| diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c |
| index 5525309..826140a 100644 |
| |
| |
| @@ -65,16 +65,21 @@ pcibios_align_resource(void *data, const struct resource *res, |
| resource_size_t size, resource_size_t align) |
| { |
| struct pci_dev *dev = data; |
| - resource_size_t start = res->start; |
| + resource_size_t start = round_down(res->end - size + 1, align); |
| |
| if (res->flags & IORESOURCE_IO) { |
| - if (skip_isa_ioresource_align(dev)) |
| - return start; |
| - if (start & 0x300) |
| - start = (start + 0x3ff) & ~0x3ff; |
| + |
| + /* |
| + * If we're avoiding ISA aliases, the largest contiguous I/O |
| + * port space is 256 bytes. Clearing bits 9 and 10 preserves |
| + * all 256-byte and smaller alignments, so the result will |
| + * still be correctly aligned. |
| + */ |
| + if (!skip_isa_ioresource_align(dev)) |
| + start &= ~0x300; |
| } else if (res->flags & IORESOURCE_MEM) { |
| if (start < BIOS_END) |
| - start = BIOS_END; |
| + start = res->end; /* fail; no space */ |
| } |
| return start; |
| } |
| |
| -- |
| To unsubscribe from this list: send the line "unsubscribe linux-pci" in |
| the body of a message to majordomo@vger.kernel.org |
| More majordomo info at http://vger.kernel.org/majordomo-info.html |
| The iomem_resource map reflects the available physical address space. |
| We statically initialize the end to -1, i.e., 0xffffffff_ffffffff, but |
| of course we can only use as much as the CPU can address. |
| |
| This patch updates the end based on the CPU capabilities, so we don't |
| mistakenly allocate space that isn't usable, as we're likely to do when |
| allocating from the top-down. |
| |
| Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> |
| |
| |
| arch/x86/kernel/setup.c | 1 + |
| 1 files changed, 1 insertions(+), 0 deletions(-) |
| |
| |
| diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c |
| index c3a4fbb..922b5a1 100644 |
| |
| |
| @@ -788,6 +788,7 @@ void __init setup_arch(char **cmdline_p) |
| |
| x86_init.oem.arch_setup(); |
| |
| + iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1; |
| setup_memory_map(); |
| parse_setup_data(); |
| /* update the e820_saved too */ |
| |
| -- |
| To unsubscribe from this list: send the line "unsubscribe linux-pci" in |
| the body of a message to majordomo@vger.kernel.org |
| More majordomo info at http://vger.kernel.org/majordomo-info.html |
| Request that allocate_resource() use available space from high addresses |
| first, rather than the default of using low addresses first. |
| |
| The most common place this makes a difference is when we move or assign |
| new PCI device resources. Low addresses are generally scarce, so it's |
| better to use high addresses when possible. This follows Windows practice |
| for PCI allocation. |
| |
| Reference: https://bugzilla.kernel.org/show_bug.cgi?id=16228#c42 |
| Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> |
| |
| |
| arch/x86/kernel/setup.c | 1 + |
| 1 files changed, 1 insertions(+), 0 deletions(-) |
| |
| |
| diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c |
| index 922b5a1..0fe76df 100644 |
| |
| |
| @@ -788,6 +788,7 @@ void __init setup_arch(char **cmdline_p) |
| |
| x86_init.oem.arch_setup(); |
| |
| + resource_alloc_from_bottom = 0; |
| iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1; |
| setup_memory_map(); |
| parse_setup_data(); |
| |
| -- |
| To unsubscribe from this list: send the line "unsubscribe linux-pci" in |
| the body of a message to majordomo@vger.kernel.org |
| More majordomo info at http://vger.kernel.org/majordomo-info.html |