From 2212bbf4813500613ad51befef27374c7728acc3 Mon Sep 17 00:00:00 2001 From: Packit Service Date: Dec 09 2020 17:29:43 +0000 Subject: irqbalance-1.4.0 base --- diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..c87f0c8 --- /dev/null +++ b/.gitignore @@ -0,0 +1,36 @@ +*.a +*.o +*~ + +.deps + +*.diff +*.patch +*.orig +*.rej + +/INSTALL +/Makefile +/Makefile.in +*/Makefile +*/Makefile.in +/aclocal.m4 +/autom4te.cache +/compile +/config.guess +/config.h +/config.h.in +/config.log +/config.status +/config.sub +/configure +/install-sh +/libtool +/ltmain.sh +/m4 +/missing +/stamp-h1 +/depcomp + +/irqbalance +irqbalance-*.tar.* diff --git a/.travis.yml b/.travis.yml new file mode 100644 index 0000000..a434cbd --- /dev/null +++ b/.travis.yml @@ -0,0 +1,11 @@ +language: c +dist: trusty + +compiler: + - clang + - gcc + +script: ./autogen.sh && ./configure && make && make check + +after_script: cat ./tests/runoneshot.sh.log + diff --git a/AUTHORS b/AUTHORS new file mode 100644 index 0000000..3cbb8a0 --- /dev/null +++ b/AUTHORS @@ -0,0 +1,3 @@ +Arjen Van De Ven +Neil Horman + diff --git a/COPYING b/COPYING new file mode 100644 index 0000000..d60c31a --- /dev/null +++ b/COPYING @@ -0,0 +1,340 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. diff --git a/Makefile.am b/Makefile.am new file mode 100644 index 0000000..abf1e8d --- /dev/null +++ b/Makefile.am @@ -0,0 +1,55 @@ +# Makefile.am -- +# Copyright 2009 Red Hat Inc., Durham, North Carolina. +# All Rights Reserved. +# +# This library is free software; you can redistribute it and/or +# modify it under the terms of the GNU Lesser General Public +# License as published by the Free Software Foundation; either +# version 2.1 of the License, or (at your option) any later version. +# +# This library is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# Lesser General Public License for more details. +# +# You should have received a copy of the GNU Lesser General Public +# License along with this library; if not, write to the Free Software +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# +# Authors: +# Steve Grubb +# + +AUTOMAKE_OPTIONS = no-dependencies +ACLOCAL_AMFLAGS = -I m4 +EXTRA_DIST = COPYING autogen.sh misc/irqbalance.service misc/irqbalance.env + +SUBDIRS = tests + +UI_DIR = ui +AM_CFLAGS = $(LIBCAP_NG_CFLAGS) $(GLIB2_CFLAGS) +AM_CPPFLAGS = -I${top_srcdir} -W -Wall -Wshadow -Wformat -Wundef -D_GNU_SOURCE +noinst_HEADERS = bitmap.h constants.h cpumask.h irqbalance.h non-atomic.h \ + types.h $(UI_DIR)/helpers.h $(UI_DIR)/irqbalance-ui.h $(UI_DIR)/ui.h +sbin_PROGRAMS = irqbalance + +if IRQBALANCEUI +sbin_PROGRAMS += irqbalance-ui +endif + +irqbalance_SOURCES = activate.c bitmap.c classify.c cputree.c irqbalance.c \ + irqlist.c numa.c placement.c procinterrupts.c +irqbalance_LDADD = $(LIBCAP_NG_LIBS) $(GLIB2_LIBS) +if IRQBALANCEUI +irqbalance_ui_SOURCES = $(UI_DIR)/helpers.c $(UI_DIR)/irqbalance-ui.c \ + $(UI_DIR)/ui.c +irqbalance_ui_LDADD = $(GLIB2_LIBS) $(CURSES_LIBS) +endif + +dist_man_MANS = irqbalance.1 + +CONFIG_CLEAN_FILES = debug*.list config/* +clean-generic: + rm -rf autom4te*.cache + rm -f *.rej *.orig *~ + diff --git a/README.md b/README.md new file mode 100644 index 0000000..8e394bd --- /dev/null +++ b/README.md @@ -0,0 +1,40 @@ +What is Irqbalance +================== + +Irqbalance is a daemon to help balance the cpu load generated by interrupts +across all of a systems cpus. Irqbalance identifies the highest volume +interrupt sources, and isolates them to a single unique cpu, so that load is +spread as much as possible over an entire processor set, while minimizing cache +miss rates for irq handlers. + +## Building and Installing [![Build Status](https://travis-ci.org/Irqbalance/irqbalance.svg?branch=master)](https://travis-ci.org/Irqbalance/irqbalance) + +```bash +./autogen.sh +./configure [options] +make +make install +``` + +## Developing Irqbalance + +Irqbalance is currently hosted on github, and so developers are welcome to use +the issue/pull request/etc infrastructure found there. However, most +development discussions take place on the irqbalance mailing list, which can be +subscribed to at: +http://lists.infradead.org/mailman/listinfo/irqbalance + +New Developers are encouraged to use this mailing list to discuss ideas and +propose patches. + +## Bug reporting + +When something goes wrong, feel free to send us bugreport by one of the ways +described above. Your report should include: + +* Irqbalance version you've been using (or commit hash) +* `/proc/interrupts` output +* `irqbalance --debug` output +* content of smp_affinity files - can be obtained by e.g.: + `$ for i in $(seq 0 300); do grep . /proc/irq/$i/smp_affinity /dev/null 2>/dev/null; done` +* your hw hierarchy - e.g. `lstopo-no-graphics` output diff --git a/activate.c b/activate.c new file mode 100644 index 0000000..8fd3dd0 --- /dev/null +++ b/activate.c @@ -0,0 +1,100 @@ +/* + * Copyright (C) 2006, Intel Corporation + * Copyright (C) 2012, Neil Horman + * + * This file is part of irqbalance + * + * This program file is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; version 2 of the License. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License + * for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program in a file named COPYING; if not, write to the + * Free Software Foundation, Inc., + * 51 Franklin Street, Fifth Floor, + * Boston, MA 02110-1301 USA + */ + +/* + * This file contains the code to communicate a selected distribution / mapping + * of interrupts to the kernel. + */ +#include "config.h" +#include +#include +#include +#include + +#include "irqbalance.h" + +static int check_affinity(struct irq_info *info, cpumask_t applied_mask) +{ + cpumask_t current_mask; + char buf[PATH_MAX]; + char *line = NULL; + size_t size = 0; + FILE *file; + + sprintf(buf, "/proc/irq/%i/smp_affinity", info->irq); + file = fopen(buf, "r"); + if (!file) + return 1; + if (getline(&line, &size, file)==0) { + free(line); + fclose(file); + return 1; + } + cpumask_parse_user(line, strlen(line), current_mask); + fclose(file); + free(line); + + return cpus_equal(applied_mask, current_mask); +} + +static void activate_mapping(struct irq_info *info, void *data __attribute__((unused))) +{ + char buf[PATH_MAX]; + FILE *file; + cpumask_t applied_mask; + int valid_mask = 0; + + /* + * only activate mappings for irqs that have moved + */ + if (!info->moved) + return; + + if (info->assigned_obj) { + applied_mask = info->assigned_obj->mask; + valid_mask = 1; + } + + /* + * Don't activate anything for which we have an invalid mask + */ + if (!valid_mask || check_affinity(info, applied_mask)) + return; + + if (!info->assigned_obj) + return; + + sprintf(buf, "/proc/irq/%i/smp_affinity", info->irq); + file = fopen(buf, "w"); + if (!file) + return; + + cpumask_scnprintf(buf, PATH_MAX, applied_mask); + fprintf(file, "%s", buf); + fclose(file); + info->moved = 0; /*migration is done*/ +} + +void activate_mappings(void) +{ + for_each_irq(NULL, activate_mapping, NULL); +} diff --git a/autogen.sh b/autogen.sh new file mode 100755 index 0000000..b792e8b --- /dev/null +++ b/autogen.sh @@ -0,0 +1,5 @@ +#! /bin/sh +set -x -e +mkdir -p m4 +# --no-recursive is available only in recent autoconf versions +autoreconf -fv --install diff --git a/bitmap.c b/bitmap.c new file mode 100644 index 0000000..6a7421a --- /dev/null +++ b/bitmap.c @@ -0,0 +1,463 @@ +/* + +This file is taken from the Linux kernel and minimally adapted for use in userspace + +*/ + +/* + * lib/bitmap.c + * Helper functions for bitmap.h. + * + * This source code is licensed under the GNU General Public License, + * Version 2. See the file COPYING for more details. + */ +#include "config.h" +#include +#include +#include +#include +#include +#include "bitmap.h" +#include "non-atomic.h" + +/* + * bitmaps provide an array of bits, implemented using an an + * array of unsigned longs. The number of valid bits in a + * given bitmap does _not_ need to be an exact multiple of + * BITS_PER_LONG. + * + * The possible unused bits in the last, partially used word + * of a bitmap are 'don't care'. The implementation makes + * no particular effort to keep them zero. It ensures that + * their value will not affect the results of any operation. + * The bitmap operations that return Boolean (bitmap_empty, + * for example) or scalar (bitmap_weight, for example) results + * carefully filter out these unused bits from impacting their + * results. + * + * These operations actually hold to a slightly stronger rule: + * if you don't input any bitmaps to these ops that have some + * unused bits set, then they won't output any set unused bits + * in output bitmaps. + * + * The byte ordering of bitmaps is more natural on little + * endian architectures. See the big-endian headers + * include/asm-ppc64/bitops.h and include/asm-s390/bitops.h + * for the best explanations of this ordering. + */ + +int __bitmap_empty(const unsigned long *bitmap, int bits) +{ + int k, lim = bits/BITS_PER_LONG; + for (k = 0; k < lim; ++k) + if (bitmap[k]) + return 0; + + if (bits % BITS_PER_LONG) + if (bitmap[k] & BITMAP_LAST_WORD_MASK(bits)) + return 0; + + return 1; +} + +int __bitmap_full(const unsigned long *bitmap, int bits) +{ + int k, lim = bits/BITS_PER_LONG; + for (k = 0; k < lim; ++k) + if (~bitmap[k]) + return 0; + + if (bits % BITS_PER_LONG) + if (~bitmap[k] & BITMAP_LAST_WORD_MASK(bits)) + return 0; + + return 1; +} + +int __bitmap_weight(const unsigned long *bitmap, int bits) +{ + int k, w = 0, lim = bits/BITS_PER_LONG; + + for (k = 0; k < lim; k++) + w += hweight_long(bitmap[k]); + + if (bits % BITS_PER_LONG) + w += hweight_long(bitmap[k] & BITMAP_LAST_WORD_MASK(bits)); + + return w; +} + +int __bitmap_equal(const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits) +{ + int k, lim = bits/BITS_PER_LONG; + for (k = 0; k < lim; ++k) + if (bitmap1[k] != bitmap2[k]) + return 0; + + if (bits % BITS_PER_LONG) + if ((bitmap1[k] ^ bitmap2[k]) & BITMAP_LAST_WORD_MASK(bits)) + return 0; + + return 1; +} + +void __bitmap_complement(unsigned long *dst, const unsigned long *src, int bits) +{ + int k, lim = bits/BITS_PER_LONG; + for (k = 0; k < lim; ++k) + dst[k] = ~src[k]; + + if (bits % BITS_PER_LONG) + dst[k] = ~src[k] & BITMAP_LAST_WORD_MASK(bits); +} + +/* + * __bitmap_shift_right - logical right shift of the bits in a bitmap + * @dst - destination bitmap + * @src - source bitmap + * @nbits - shift by this many bits + * @bits - bitmap size, in bits + * + * Shifting right (dividing) means moving bits in the MS -> LS bit + * direction. Zeros are fed into the vacated MS positions and the + * LS bits shifted off the bottom are lost. + */ +void __bitmap_shift_right(unsigned long *dst, + const unsigned long *src, int shift, int bits) +{ + int k, lim = BITS_TO_LONGS(bits), left = bits % BITS_PER_LONG; + int off = shift/BITS_PER_LONG, rem = shift % BITS_PER_LONG; + unsigned long mask = (1UL << left) - 1; + for (k = 0; off + k < lim; ++k) { + unsigned long upper, lower; + + /* + * If shift is not word aligned, take lower rem bits of + * word above and make them the top rem bits of result. + */ + if (!rem || off + k + 1 >= lim) + upper = 0; + else { + upper = src[off + k + 1]; + if (off + k + 1 == lim - 1 && left) + upper &= mask; + } + lower = src[off + k]; + if (left && off + k == lim - 1) + lower &= mask; + dst[k] = upper << (BITS_PER_LONG - rem) | lower >> rem; + if (left && k == lim - 1) + dst[k] &= mask; + } + if (off) + memset(&dst[lim - off], 0, off*sizeof(unsigned long)); +} + + +/* + * __bitmap_shift_left - logical left shift of the bits in a bitmap + * @dst - destination bitmap + * @src - source bitmap + * @nbits - shift by this many bits + * @bits - bitmap size, in bits + * + * Shifting left (multiplying) means moving bits in the LS -> MS + * direction. Zeros are fed into the vacated LS bit positions + * and those MS bits shifted off the top are lost. + */ + +void __bitmap_shift_left(unsigned long *dst, + const unsigned long *src, int shift, int bits) +{ + int k, lim = BITS_TO_LONGS(bits), left = bits % BITS_PER_LONG; + int off = shift/BITS_PER_LONG, rem = shift % BITS_PER_LONG; + for (k = lim - off - 1; k >= 0; --k) { + unsigned long upper, lower; + + /* + * If shift is not word aligned, take upper rem bits of + * word below and make them the bottom rem bits of result. + */ + if (rem && k > 0) + lower = src[k - 1]; + else + lower = 0; + upper = src[k]; + if (left && k == lim - 1) + upper &= (1UL << left) - 1; + dst[k + off] = lower >> (BITS_PER_LONG - rem) | upper << rem; + if (left && k + off == lim - 1) + dst[k + off] &= (1UL << left) - 1; + } + if (off) + memset(dst, 0, off*sizeof(unsigned long)); +} + +void __bitmap_and(unsigned long *dst, const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits) +{ + int k; + int nr = BITS_TO_LONGS(bits); + + for (k = 0; k < nr; k++) + dst[k] = bitmap1[k] & bitmap2[k]; +} + +void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits) +{ + int k; + int nr = BITS_TO_LONGS(bits); + + for (k = 0; k < nr; k++) + dst[k] = bitmap1[k] | bitmap2[k]; +} + +void __bitmap_xor(unsigned long *dst, const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits) +{ + int k; + int nr = BITS_TO_LONGS(bits); + + for (k = 0; k < nr; k++) + dst[k] = bitmap1[k] ^ bitmap2[k]; +} + +void __bitmap_andnot(unsigned long *dst, const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits) +{ + int k; + int nr = BITS_TO_LONGS(bits); + + for (k = 0; k < nr; k++) + dst[k] = bitmap1[k] & ~bitmap2[k]; +} + +int __bitmap_intersects(const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits) +{ + int k, lim = bits/BITS_PER_LONG; + for (k = 0; k < lim; ++k) + if (bitmap1[k] & bitmap2[k]) + return 1; + + if (bits % BITS_PER_LONG) + if ((bitmap1[k] & bitmap2[k]) & BITMAP_LAST_WORD_MASK(bits)) + return 1; + return 0; +} + +/* + * Bitmap printing & parsing functions: first version by Bill Irwin, + * second version by Paul Jackson, third by Joe Korty. + */ + +#define CHUNKSZ 32 +#define nbits_to_hold_value(val) fls(val) +#define unhex(c) (isdigit(c) ? (c - '0') : (toupper(c) - 'A' + 10)) +#define BASEDEC 10 /* fancier cpuset lists input in decimal */ + +/** + * bitmap_scnprintf - convert bitmap to an ASCII hex string. + * @buf: byte buffer into which string is placed + * @buflen: reserved size of @buf, in bytes + * @maskp: pointer to bitmap to convert + * @nmaskbits: size of bitmap, in bits + * + * Exactly @nmaskbits bits are displayed. Hex digits are grouped into + * comma-separated sets of eight digits per set. + */ +int bitmap_scnprintf(char *buf, unsigned int buflen, + const unsigned long *maskp, int nmaskbits) +{ + int i, word, bit, len = 0; + unsigned long val; + const char *sep = ""; + int chunksz; + uint32_t chunkmask; + int first = 1; + + chunksz = nmaskbits & (CHUNKSZ - 1); + if (chunksz == 0) + chunksz = CHUNKSZ; + + i = ALIGN(nmaskbits, CHUNKSZ) - CHUNKSZ; + for (; i >= 0; i -= CHUNKSZ) { + chunkmask = ((1ULL << chunksz) - 1); + word = i / BITS_PER_LONG; + bit = i % BITS_PER_LONG; + val = (maskp[word] >> bit) & chunkmask; + if (val!=0 || !first || i==0) { + len += snprintf(buf+len, buflen-len, "%s%0*lx", sep, + (chunksz+3)/4, val); + sep = ","; + first = 0; + } + chunksz = CHUNKSZ; + } + return len; +} + +/** + * __bitmap_parse - convert an ASCII hex string into a bitmap. + * @buf: pointer to buffer containing string. + * @buflen: buffer size in bytes. If string is smaller than this + * then it must be terminated with a \0. + * @is_user: location of buffer, 0 indicates kernel space + * @maskp: pointer to bitmap array that will contain result. + * @nmaskbits: size of bitmap, in bits. + * + * Commas group hex digits into chunks. Each chunk defines exactly 32 + * bits of the resultant bitmask. No chunk may specify a value larger + * than 32 bits (%-EOVERFLOW), and if a chunk specifies a smaller value + * then leading 0-bits are prepended. %-EINVAL is returned for illegal + * characters and for grouping errors such as "1,,5", ",44", "," and "". + * Leading and trailing whitespace accepted, but not embedded whitespace. + */ +int __bitmap_parse(const char *buf, unsigned int buflen, + int is_user __attribute((unused)), unsigned long *maskp, + int nmaskbits) +{ + int c, old_c, totaldigits, ndigits, nchunks, nbits; + uint32_t chunk; + + bitmap_zero(maskp, nmaskbits); + + nchunks = nbits = totaldigits = c = 0; + do { + chunk = ndigits = 0; + + /* Get the next chunk of the bitmap */ + while (buflen) { + old_c = c; + c = *buf++; + buflen--; + if (isspace(c)) + continue; + + /* + * If the last character was a space and the current + * character isn't '\0', we've got embedded whitespace. + * This is a no-no, so throw an error. + */ + if (totaldigits && c && isspace(old_c)) + return 0; + + /* A '\0' or a ',' signal the end of the chunk */ + if (c == '\0' || c == ',') + break; + + if (!isxdigit(c)) + return -EINVAL; + + /* + * Make sure there are at least 4 free bits in 'chunk'. + * If not, this hexdigit will overflow 'chunk', so + * throw an error. + */ + if (chunk & ~((1UL << (CHUNKSZ - 4)) - 1)) + return -EOVERFLOW; + + chunk = (chunk << 4) | unhex(c); + ndigits++; totaldigits++; + } + if (ndigits == 0) + return -EINVAL; + if (nchunks == 0 && chunk == 0) + continue; + + __bitmap_shift_left(maskp, maskp, CHUNKSZ, nmaskbits); + *maskp |= chunk; + nchunks++; + nbits += (nchunks == 1) ? nbits_to_hold_value(chunk) : CHUNKSZ; + if (nbits > nmaskbits) + return -EOVERFLOW; + } while (buflen && c == ','); + + return 0; +} + +/** + * __bitmap_parselist - convert list format ASCII string to bitmap + * @buf: read nul-terminated user string from this buffer + * @buflen: buffer size in bytes. If string is smaller than this + * then it must be terminated with a \0. + * @is_user: location of buffer, 0 indicates kernel space + * @maskp: write resulting mask here + * @nmaskbits: number of bits in mask to be written + * + * Input format is a comma-separated list of decimal numbers and + * ranges. Consecutively set bits are shown as two hyphen-separated + * decimal numbers, the smallest and largest bit numbers set in + * the range. + * + * Returns 0 on success, -errno on invalid input strings. + * Error values: + * %-EINVAL: second number in range smaller than first + * %-EINVAL: invalid character in string + * %-ERANGE: bit number specified too large for mask + */ +int __bitmap_parselist(const char *buf, unsigned int buflen, + int is_user __attribute((unused)), unsigned long *maskp, + int nmaskbits) +{ + int a, b, c, old_c, totaldigits; + int exp_digit, in_range; + + totaldigits = c = 0; + bitmap_zero(maskp, nmaskbits); + do { + exp_digit = 1; + in_range = 0; + a = b = 0; + + /* Get the next cpu# or a range of cpu#'s */ + while (buflen) { + old_c = c; + c = *buf++; + buflen--; + if (isspace(c)) + continue; + + /* + * If the last character was a space and the current + * character isn't '\0', we've got embedded whitespace. + * This is a no-no, so throw an error. + */ + if (totaldigits && c && isspace(old_c)) + return -EINVAL; + + /* A '\0' or a ',' signal the end of a cpu# or range */ + if (c == '\0' || c == ',') + break; + + if (c == '-') { + if (exp_digit || in_range) + return -EINVAL; + b = 0; + in_range = 1; + exp_digit = 1; + continue; + } + + if (!isdigit(c)) + return -EINVAL; + + b = b * 10 + (c - '0'); + if (!in_range) + a = b; + exp_digit = 0; + totaldigits++; + } + if (!(a <= b)) + return -EINVAL; + if (b >= nmaskbits) + return -ERANGE; + while (a <= b) { + set_bit(a, maskp); + a++; + } + } while (buflen && c == ','); + return 0; +} diff --git a/bitmap.h b/bitmap.h new file mode 100644 index 0000000..7afce59 --- /dev/null +++ b/bitmap.h @@ -0,0 +1,362 @@ +#ifndef __LINUX_BITMAP_H +#define __LINUX_BITMAP_H + +#ifndef __ASSEMBLY__ + +#include +#include +#include + + +#define BITS_PER_LONG ((int)sizeof(unsigned long)*8) + +#define BITS_TO_LONGS(bits) \ + (((bits)+BITS_PER_LONG-1)/BITS_PER_LONG) +#define DECLARE_BITMAP(name,bits) \ + unsigned long name[BITS_TO_LONGS(bits)] +#define ALIGN(x,a) (((x)+(a)-1UL)&~((a)-1UL)) + + +#include "non-atomic.h" + +static inline unsigned int hweight32(unsigned int w) +{ + unsigned int res = w - ((w >> 1) & 0x55555555); + res = (res & 0x33333333) + ((res >> 2) & 0x33333333); + res = (res + (res >> 4)) & 0x0F0F0F0F; + res = res + (res >> 8); + return (res + (res >> 16)) & 0x000000FF; +} + +static inline unsigned long hweight64(uint64_t w) +{ + if (BITS_PER_LONG == 32) + return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w); + + w -= (w >> 1) & 0x5555555555555555ull; + w = (w & 0x3333333333333333ull) + ((w >> 2) & 0x3333333333333333ull); + w = (w + (w >> 4)) & 0x0f0f0f0f0f0f0f0full; + return (w * 0x0101010101010101ull) >> 56; +} + + +static inline int fls(int x) +{ + int r = 32; + + if (!x) + return 0; + if (!(x & 0xffff0000u)) { + x <<= 16; + r -= 16; + } + if (!(x & 0xff000000u)) { + x <<= 8; + r -= 8; + } + if (!(x & 0xf0000000u)) { + x <<= 4; + r -= 4; + } + if (!(x & 0xc0000000u)) { + x <<= 2; + r -= 2; + } + if (!(x & 0x80000000u)) { + x <<= 1; + r -= 1; + } + return r; +} + +static inline unsigned long hweight_long(unsigned long w) +{ + return sizeof(w) == 4 ? hweight32(w) : hweight64(w); +} + +#define min(x,y) ({ \ + typeof(x) _x = (x); \ + typeof(y) _y = (y); \ + (void) (&_x == &_y); \ + _x < _y ? _x : _y; }) + + +/* + * bitmaps provide bit arrays that consume one or more unsigned + * longs. The bitmap interface and available operations are listed + * here, in bitmap.h + * + * Function implementations generic to all architectures are in + * lib/bitmap.c. Functions implementations that are architecture + * specific are in various include/asm-/bitops.h headers + * and other arch/ specific files. + * + * See lib/bitmap.c for more details. + */ + +/* + * The available bitmap operations and their rough meaning in the + * case that the bitmap is a single unsigned long are thus: + * + * Note that nbits should be always a compile time evaluable constant. + * Otherwise many inlines will generate horrible code. + * + * bitmap_zero(dst, nbits) *dst = 0UL + * bitmap_fill(dst, nbits) *dst = ~0UL + * bitmap_copy(dst, src, nbits) *dst = *src + * bitmap_and(dst, src1, src2, nbits) *dst = *src1 & *src2 + * bitmap_or(dst, src1, src2, nbits) *dst = *src1 | *src2 + * bitmap_xor(dst, src1, src2, nbits) *dst = *src1 ^ *src2 + * bitmap_andnot(dst, src1, src2, nbits) *dst = *src1 & ~(*src2) + * bitmap_complement(dst, src, nbits) *dst = ~(*src) + * bitmap_equal(src1, src2, nbits) Are *src1 and *src2 equal? + * bitmap_intersects(src1, src2, nbits) Do *src1 and *src2 overlap? + * bitmap_subset(src1, src2, nbits) Is *src1 a subset of *src2? + * bitmap_empty(src, nbits) Are all bits zero in *src? + * bitmap_full(src, nbits) Are all bits set in *src? + * bitmap_weight(src, nbits) Hamming Weight: number set bits + * bitmap_shift_right(dst, src, n, nbits) *dst = *src >> n + * bitmap_shift_left(dst, src, n, nbits) *dst = *src << n + * bitmap_remap(dst, src, old, new, nbits) *dst = map(old, new)(src) + * bitmap_bitremap(oldbit, old, new, nbits) newbit = map(old, new)(oldbit) + * bitmap_scnprintf(buf, len, src, nbits) Print bitmap src to buf + * bitmap_parse(buf, buflen, dst, nbits) Parse bitmap dst from kernel buf + * bitmap_parse_user(ubuf, ulen, dst, nbits) Parse bitmap dst from user buf + * bitmap_scnlistprintf(buf, len, src, nbits) Print bitmap src as list to buf + * bitmap_parselist(buf, dst, nbits) Parse bitmap dst from list + * bitmap_find_free_region(bitmap, bits, order) Find and allocate bit region + * bitmap_release_region(bitmap, pos, order) Free specified bit region + * bitmap_allocate_region(bitmap, pos, order) Allocate specified bit region + */ + +/* + * Also the following operations in asm/bitops.h apply to bitmaps. + * + * set_bit(bit, addr) *addr |= bit + * clear_bit(bit, addr) *addr &= ~bit + * change_bit(bit, addr) *addr ^= bit + * test_bit(bit, addr) Is bit set in *addr? + * test_and_set_bit(bit, addr) Set bit and return old value + * test_and_clear_bit(bit, addr) Clear bit and return old value + * test_and_change_bit(bit, addr) Change bit and return old value + * find_first_zero_bit(addr, nbits) Position first zero bit in *addr + * find_first_bit(addr, nbits) Position first set bit in *addr + * find_next_zero_bit(addr, nbits, bit) Position next zero bit in *addr >= bit + * find_next_bit(addr, nbits, bit) Position next set bit in *addr >= bit + */ + +/* + * The DECLARE_BITMAP(name,bits) macro, in linux/types.h, can be used + * to declare an array named 'name' of just enough unsigned longs to + * contain all bit positions from 0 to 'bits' - 1. + */ + +/* + * lib/bitmap.c provides these functions: + */ + +extern int __bitmap_empty(const unsigned long *bitmap, int bits); +extern int __bitmap_full(const unsigned long *bitmap, int bits); +extern int __bitmap_equal(const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits); +extern void __bitmap_complement(unsigned long *dst, const unsigned long *src, + int bits); +extern void __bitmap_shift_right(unsigned long *dst, + const unsigned long *src, int shift, int bits); +extern void __bitmap_shift_left(unsigned long *dst, + const unsigned long *src, int shift, int bits); +extern void __bitmap_and(unsigned long *dst, const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits); +extern void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits); +extern void __bitmap_xor(unsigned long *dst, const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits); +extern void __bitmap_andnot(unsigned long *dst, const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits); +extern int __bitmap_intersects(const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits); +extern int __bitmap_subset(const unsigned long *bitmap1, + const unsigned long *bitmap2, int bits); +extern int __bitmap_weight(const unsigned long *bitmap, int bits); + +extern int bitmap_scnprintf(char *buf, unsigned int len, + const unsigned long *src, int nbits); +extern int __bitmap_parse(const char *buf, unsigned int buflen, int is_user, + unsigned long *dst, int nbits); +extern int bitmap_scnlistprintf(char *buf, unsigned int len, + const unsigned long *src, int nbits); +extern int __bitmap_parselist(const char *buf, unsigned int buflen, int is_user, + unsigned long *dst, int nbits); +extern void bitmap_remap(unsigned long *dst, const unsigned long *src, + const unsigned long *old, const unsigned long *new, int bits); +extern int bitmap_bitremap(int oldbit, + const unsigned long *old, const unsigned long *new, int bits); +extern int bitmap_find_free_region(unsigned long *bitmap, int bits, int order); +extern void bitmap_release_region(unsigned long *bitmap, int pos, int order); +extern int bitmap_allocate_region(unsigned long *bitmap, int pos, int order); + +#define BITMAP_LAST_WORD_MASK(nbits) \ +( \ + ((nbits) % BITS_PER_LONG) ? \ + (1UL<<((nbits) % BITS_PER_LONG))-1 : ~0UL \ +) + +static inline void bitmap_zero(unsigned long *dst, int nbits) +{ + if (nbits <= BITS_PER_LONG) + *dst = 0UL; + else { + int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); + memset(dst, 0, len); + } +} + +static inline void bitmap_fill(unsigned long *dst, int nbits) +{ + size_t nlongs = BITS_TO_LONGS(nbits); + if (nlongs > 1) { + int len = (nlongs - 1) * sizeof(unsigned long); + memset(dst, 0xff, len); + } + dst[nlongs - 1] = BITMAP_LAST_WORD_MASK(nbits); +} + +static inline void bitmap_copy(unsigned long *dst, const unsigned long *src, + int nbits) +{ + if (nbits <= BITS_PER_LONG) + *dst = *src; + else { + int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); + memcpy(dst, src, len); + } +} + +static inline void bitmap_and(unsigned long *dst, const unsigned long *src1, + const unsigned long *src2, int nbits) +{ + if (nbits <= BITS_PER_LONG) + *dst = *src1 & *src2; + else + __bitmap_and(dst, src1, src2, nbits); +} + +static inline void bitmap_or(unsigned long *dst, const unsigned long *src1, + const unsigned long *src2, int nbits) +{ + if (nbits <= BITS_PER_LONG) + *dst = *src1 | *src2; + else + __bitmap_or(dst, src1, src2, nbits); +} + +static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1, + const unsigned long *src2, int nbits) +{ + if (nbits <= BITS_PER_LONG) + *dst = *src1 ^ *src2; + else + __bitmap_xor(dst, src1, src2, nbits); +} + +static inline void bitmap_andnot(unsigned long *dst, const unsigned long *src1, + const unsigned long *src2, int nbits) +{ + if (nbits <= BITS_PER_LONG) + *dst = *src1 & ~(*src2); + else + __bitmap_andnot(dst, src1, src2, nbits); +} + +static inline void bitmap_complement(unsigned long *dst, const unsigned long *src, + int nbits) +{ + if (nbits <= BITS_PER_LONG) + *dst = ~(*src) & BITMAP_LAST_WORD_MASK(nbits); + else + __bitmap_complement(dst, src, nbits); +} + +static inline int bitmap_equal(const unsigned long *src1, + const unsigned long *src2, int nbits) +{ + if (nbits <= BITS_PER_LONG) + return ! ((*src1 ^ *src2) & BITMAP_LAST_WORD_MASK(nbits)); + else + return __bitmap_equal(src1, src2, nbits); +} + +static inline int bitmap_intersects(const unsigned long *src1, + const unsigned long *src2, int nbits) +{ + if (nbits <= BITS_PER_LONG) + return ((*src1 & *src2) & BITMAP_LAST_WORD_MASK(nbits)) != 0; + else + return __bitmap_intersects(src1, src2, nbits); +} + +static inline int bitmap_subset(const unsigned long *src1, + const unsigned long *src2, int nbits) +{ + if (nbits <= BITS_PER_LONG) + return ! ((*src1 & ~(*src2)) & BITMAP_LAST_WORD_MASK(nbits)); + else + return __bitmap_subset(src1, src2, nbits); +} + +static inline int bitmap_empty(const unsigned long *src, int nbits) +{ + if (nbits <= BITS_PER_LONG) + return ! (*src & BITMAP_LAST_WORD_MASK(nbits)); + else + return __bitmap_empty(src, nbits); +} + +static inline int bitmap_full(const unsigned long *src, int nbits) +{ + if (nbits <= BITS_PER_LONG) + return ! (~(*src) & BITMAP_LAST_WORD_MASK(nbits)); + else + return __bitmap_full(src, nbits); +} + +static inline int bitmap_weight(const unsigned long *src, int nbits) +{ + if (nbits <= BITS_PER_LONG) + return hweight_long(*src & BITMAP_LAST_WORD_MASK(nbits)); + return __bitmap_weight(src, nbits); +} + +static inline void bitmap_shift_right(unsigned long *dst, + const unsigned long *src, int n, int nbits) +{ + if (nbits <= BITS_PER_LONG) + *dst = *src >> n; + else + __bitmap_shift_right(dst, src, n, nbits); +} + +static inline void bitmap_shift_left(unsigned long *dst, + const unsigned long *src, int n, int nbits) +{ + if (nbits <= BITS_PER_LONG) + *dst = (*src << n) & BITMAP_LAST_WORD_MASK(nbits); + else + __bitmap_shift_left(dst, src, n, nbits); +} + +static inline int bitmap_parse(const char *buf, unsigned int buflen, + unsigned long *maskp, int nmaskbits) +{ + return __bitmap_parse(buf, buflen, 0, maskp, nmaskbits); +} + +static inline int bitmap_parselist(const char *buf, unsigned int buflen, + unsigned long *maskp, int nmaskbits) +{ + return __bitmap_parselist(buf, buflen, 0, maskp, nmaskbits); +} + +#endif /* __ASSEMBLY__ */ + +#endif /* __LINUX_BITMAP_H */ diff --git a/classify.c b/classify.c new file mode 100644 index 0000000..df8a89b --- /dev/null +++ b/classify.c @@ -0,0 +1,854 @@ +#include "config.h" +#include +#include +#include +#include +#include +#include +#include + +#include "irqbalance.h" +#include "types.h" + + +char *classes[] = { + "other", + "legacy", + "storage", + "video", + "ethernet", + "gbit-ethernet", + "10gbit-ethernet", + "virt-event", + 0 +}; + +static int map_class_to_level[8] = +{ BALANCE_PACKAGE, BALANCE_CACHE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE }; + +struct user_irq_policy { + int ban; + int level; + int numa_node_set; + int numa_node; +}; + +static GList *interrupts_db = NULL; +static GList *banned_irqs = NULL; +GList *cl_banned_irqs = NULL; +static GList *cl_banned_modules = NULL; + +#define SYSFS_DIR "/sys" +#define SYSDEV_DIR "/sys/bus/pci/devices" + +#define PCI_MAX_CLASS 0x14 +#define PCI_MAX_SERIAL_SUBCLASS 0x81 + +#define PCI_INVAL_DATA 0xFFFFFFFF + +struct pci_info { + unsigned short vendor; + unsigned short device; + unsigned short sub_vendor; + unsigned short sub_device; + unsigned int class; +}; + +/* PCI vendor ID, device ID */ +#define PCI_VENDOR_PLX 0x10b5 +#define PCI_DEVICE_PLX_PEX8619 0x8619 +#define PCI_VENDOR_CAVIUM 0x177d +#define PCI_DEVICE_CAVIUM_CN61XX 0x0093 + +/* PCI subsystem vendor ID, subsystem device ID */ +#define PCI_SUB_VENDOR_EMC 0x1120 +#define PCI_SUB_DEVICE_EMC_055B 0x055b +#define PCI_SUB_DEVICE_EMC_0568 0x0568 +#define PCI_SUB_DEVICE_EMC_dd00 0xdd00 + +/* + * Apply software workarounds for some special devices + * + * The world is not perfect and supplies us with broken PCI devices. + * Usually there are two sort of cases: + * + * 1. The device is special + * Before shipping the devices, PCI spec doesn't have the definitions. + * + * 2. Buggy PCI devices + * Some PCI devices don't follow the PCI class code definitions. + */ +static void apply_pci_quirks(const struct pci_info *pci, int *irq_class) +{ + if ((pci->vendor == PCI_VENDOR_PLX) && + (pci->device == PCI_DEVICE_PLX_PEX8619) && + (pci->sub_vendor == PCI_SUB_VENDOR_EMC)) { + switch (pci->sub_device) { + case PCI_SUB_DEVICE_EMC_055B: + case PCI_SUB_DEVICE_EMC_dd00: + *irq_class = IRQ_SCSI; + break; + } + } + + if ((pci->vendor == PCI_VENDOR_CAVIUM) && + (pci->device == PCI_DEVICE_CAVIUM_CN61XX) && + (pci->sub_vendor == PCI_SUB_VENDOR_EMC)) { + switch (pci->sub_device) { + case PCI_SUB_DEVICE_EMC_0568: + *irq_class = IRQ_SCSI; + break; + } + } + + return; +} + +/* Determin IRQ class based on PCI class code */ +static int map_pci_irq_class(unsigned int pci_class) +{ + unsigned int major = pci_class >> 16; + unsigned int sub = (pci_class & 0xFF00) >> 8; + int irq_class = IRQ_NODEF; + /* + * Class codes lifted from below PCI-SIG spec: + * + * PCI Code and ID Assignment Specification v1.5 + * + * and mapped to irqbalance types here. + * + * IRQ_NODEF will go through classification by PCI sub-class code. + */ + static short major_class_codes[PCI_MAX_CLASS] = { + IRQ_OTHER, + IRQ_SCSI, + IRQ_ETH, + IRQ_VIDEO, + IRQ_OTHER, + IRQ_OTHER, + IRQ_LEGACY, + IRQ_OTHER, + IRQ_OTHER, + IRQ_LEGACY, + IRQ_OTHER, + IRQ_OTHER, + IRQ_NODEF, + IRQ_ETH, + IRQ_SCSI, + IRQ_OTHER, + IRQ_OTHER, + IRQ_OTHER, + IRQ_LEGACY, + IRQ_LEGACY, + }; + + /* + * All sub-class code for serial bus controllers. + * The major class code is 0xc. + */ + static short serial_sub_codes[PCI_MAX_SERIAL_SUBCLASS] = { + IRQ_LEGACY, + IRQ_LEGACY, + IRQ_LEGACY, + IRQ_LEGACY, + IRQ_SCSI, + IRQ_LEGACY, + IRQ_SCSI, + IRQ_LEGACY, + IRQ_LEGACY, + IRQ_LEGACY, + [0xa ... 0x7f] = IRQ_NODEF, + IRQ_LEGACY, + }; + + /* + * Check major class code first + */ + + if (major >= PCI_MAX_CLASS) + return IRQ_NODEF; + + switch (major) { + case 0xc: /* Serial bus class */ + if (sub >= PCI_MAX_SERIAL_SUBCLASS) + return IRQ_NODEF; + irq_class = serial_sub_codes[sub]; + break; + default: /* All other PCI classes */ + irq_class = major_class_codes[major]; + break; + } + + return irq_class; +} + +/* Read specific data from sysfs */ +static unsigned int read_pci_data(const char *devpath, const char* file) +{ + char path[PATH_MAX]; + FILE *fd; + unsigned int data = PCI_INVAL_DATA; + + sprintf(path, "%s/%s", devpath, file); + + fd = fopen(path, "r"); + + if (!fd) { + log(TO_CONSOLE, LOG_WARNING, "PCI: can't open file:%s\n", path); + return data; + } + + (void) fscanf(fd, "%x", &data); + fclose(fd); + + return data; +} + +/* Get pci information for IRQ classification */ +static int get_pci_info(const char *devpath, struct pci_info *pci) +{ + unsigned int data = PCI_INVAL_DATA; + + if ((data = read_pci_data(devpath, "vendor")) == PCI_INVAL_DATA) + return -ENODEV; + pci->vendor = (unsigned short)data; + + if ((data = read_pci_data(devpath, "device")) == PCI_INVAL_DATA) + return -ENODEV; + pci->device = (unsigned short)data; + + if ((data = read_pci_data(devpath, "subsystem_vendor")) == PCI_INVAL_DATA) + return -ENODEV; + pci->sub_vendor = (unsigned short)data; + + if ((data = read_pci_data(devpath, "subsystem_device")) == PCI_INVAL_DATA) + return -ENODEV; + pci->sub_device = (unsigned short)data; + + if ((data = read_pci_data(devpath, "class")) == PCI_INVAL_DATA) + return -ENODEV; + pci->class = data; + + return 0; +} + +/* Return IRQ class for given devpath */ +static int get_irq_class(const char *devpath) +{ + int irq_class = IRQ_NODEF; + struct pci_info pci; + + /* Get PCI info from sysfs */ + if (get_pci_info(devpath, &pci) < 0) + return IRQ_NODEF; + + /* Map PCI class code to irq class */ + irq_class = map_pci_irq_class(pci.class); + if (irq_class < 0) { + log(TO_CONSOLE, LOG_WARNING, "Invalid PCI class code %d\n", + pci.class); + return IRQ_NODEF; + } + + /* Reassign irq class for some buggy devices */ + apply_pci_quirks(&pci, &irq_class); + + return irq_class; +} + +static gint compare_ints(gconstpointer a, gconstpointer b) +{ + const struct irq_info *ai = a; + const struct irq_info *bi = b; + + return ai->irq - bi->irq; +} + +static void add_banned_irq(int irq, GList **list) +{ + struct irq_info find, *new; + GList *entry; + + find.irq = irq; + entry = g_list_find_custom(*list, &find, compare_ints); + if (entry) + return; + + new = calloc(sizeof(struct irq_info), 1); + if (!new) { + log(TO_CONSOLE, LOG_WARNING, "No memory to ban irq %d\n", irq); + return; + } + + new->irq = irq; + new->flags |= IRQ_FLAG_BANNED; + + *list = g_list_append(*list, new); + log(TO_CONSOLE, LOG_INFO, "IRQ %d was BANNED.\n", irq); + return; +} + +void add_cl_banned_irq(int irq) +{ + add_banned_irq(irq, &cl_banned_irqs); +} + +static int is_banned_irq(int irq) +{ + GList *entry; + struct irq_info find; + + find.irq = irq; + + entry = g_list_find_custom(banned_irqs, &find, compare_ints); + return entry ? 1:0; +} + +gint substr_find(gconstpointer a, gconstpointer b) +{ + if (strstr(b, a)) + return 0; + else + return 1; +} + +static void add_banned_module(char *modname, GList **modlist) +{ + GList *entry; + char *newmod; + + entry = g_list_find_custom(*modlist, modname, substr_find); + if (entry) + return; + + newmod = strdup(modname); + if (!newmod) { + log(TO_CONSOLE, LOG_WARNING, "No memory to ban module %s\n", modname); + return; + } + + *modlist = g_list_append(*modlist, newmod); +} + +void add_cl_banned_module(char *modname) +{ + add_banned_module(modname, &cl_banned_modules); +} + + +/* + * Inserts an irq_info struct into the intterupts_db list + * devpath points to the device directory in sysfs for the + * related device. NULL devpath means no sysfs entries for + * this irq. + */ +static struct irq_info *add_one_irq_to_db(const char *devpath, int irq, struct user_irq_policy *pol) +{ + int irq_class = IRQ_OTHER; + struct irq_info *new, find; + int numa_node; + char path[PATH_MAX]; + FILE *fd; + char *lcpu_mask; + GList *entry; + ssize_t ret; + size_t blen; + + /* + * First check to make sure this isn't a duplicate entry + */ + find.irq = irq; + entry = g_list_find_custom(interrupts_db, &find, compare_ints); + if (entry) { + log(TO_CONSOLE, LOG_INFO, "DROPPING DUPLICATE ENTRY FOR IRQ %d on path %s\n", irq, devpath); + return NULL; + } + + if (is_banned_irq(irq)) { + log(TO_ALL, LOG_INFO, "SKIPPING BANNED IRQ %d\n", irq); + return NULL; + } + + new = calloc(sizeof(struct irq_info), 1); + if (!new) + return NULL; + + new->irq = irq; + new->class = IRQ_OTHER; + + interrupts_db = g_list_append(interrupts_db, new); + + /* Some special irqs have NULL devpath */ + if (devpath != NULL) { + /* Map PCI class code to irq class */ + irq_class = get_irq_class(devpath); + if (irq_class < 0) + goto get_numa_node; + } + + new->class = irq_class; + if (pol->level >= 0) + new->level = pol->level; + else + new->level = map_class_to_level[irq_class]; + +get_numa_node: + numa_node = -1; + if (numa_avail) { + sprintf(path, "%s/numa_node", devpath); + fd = fopen(path, "r"); + if (fd) { + fscanf(fd, "%d", &numa_node); + fclose(fd); + } + } + + if (pol->numa_node_set == 1) + new->numa_node = get_numa_node(pol->numa_node); + else + new->numa_node = get_numa_node(numa_node); + + sprintf(path, "%s/local_cpus", devpath); + fd = fopen(path, "r"); + if (!fd) { + cpus_setall(new->cpumask); + goto out; + } + lcpu_mask = NULL; + ret = getline(&lcpu_mask, &blen, fd); + fclose(fd); + if (ret <= 0) { + cpus_setall(new->cpumask); + } else { + cpumask_parse_user(lcpu_mask, ret, new->cpumask); + } + free(lcpu_mask); + +out: + log(TO_CONSOLE, LOG_INFO, "Adding IRQ %d to database\n", irq); + return new; +} + +static void parse_user_policy_key(char *buf, int irq, struct user_irq_policy *pol) +{ + char *key, *value, *end; + char *levelvals[] = { "none", "package", "cache", "core" }; + int idx; + int key_set = 1; + + key = buf; + value = strchr(buf, '='); + + if (!value) { + log(TO_SYSLOG, LOG_WARNING, "Bad format for policy, ignoring: %s\n", buf); + return; + } + + /* NULL terminate the key and advance value to the start of the value + * string + */ + *value = '\0'; + value++; + end = strchr(value, '\n'); + if (end) + *end = '\0'; + + if (!strcasecmp("ban", key)) { + if (!strcasecmp("false", value)) + pol->ban = 0; + else if (!strcasecmp("true", value)) + pol->ban = 1; + else { + key_set = 0; + log(TO_ALL, LOG_WARNING, "Unknown value for ban policy: %s\n", value); + } + } else if (!strcasecmp("balance_level", key)) { + for (idx=0; idx<4; idx++) { + if (!strcasecmp(levelvals[idx], value)) + break; + } + + if (idx>3) { + key_set = 0; + log(TO_ALL, LOG_WARNING, "Bad value for balance_level policy: %s\n", value); + } else + pol->level = idx; + } else if (!strcasecmp("numa_node", key)) { + idx = strtoul(value, NULL, 10); + if (!get_numa_node(idx)) { + log(TO_ALL, LOG_WARNING, "NUMA node %d doesn't exist\n", + idx); + return; + } + pol->numa_node = idx; + pol->numa_node_set = 1; + } else { + key_set = 0; + log(TO_ALL, LOG_WARNING, "Unknown key returned, ignoring: %s\n", key); + } + + if (key_set) + log(TO_ALL, LOG_INFO, "IRQ %d: Override %s to %s\n", irq, key, value); + + +} + +/* + * Calls out to a possibly user defined script to get user assigned policy + * aspects for a given irq. A value of -1 in a given field indicates no + * policy was given and that system defaults should be used + */ +static void get_irq_user_policy(char *path, int irq, struct user_irq_policy *pol) +{ + char *cmd; + FILE *output; + char buffer[128]; + char *brc; + + memset(pol, -1, sizeof(struct user_irq_policy)); + + /* Return defaults if no script was given */ + if (!polscript) + return; + + /* Use SYSFS_DIR for irq has no sysfs entries */ + if (!path) + path = SYSFS_DIR; + + cmd = alloca(strlen(path)+strlen(polscript)+64); + if (!cmd) + return; + + sprintf(cmd, "exec %s %s %d", polscript, path, irq); + output = popen(cmd, "r"); + if (!output) { + log(TO_ALL, LOG_WARNING, "Unable to execute user policy script %s\n", polscript); + return; + } + + while(!feof(output)) { + brc = fgets(buffer, 128, output); + if (brc) + parse_user_policy_key(brc, irq, pol); + } + pclose(output); +} + +static int check_for_module_ban(char *name) +{ + GList *entry; + + entry = g_list_find_custom(cl_banned_modules, name, substr_find); + + if (entry) + return 1; + else + return 0; +} + +static int check_for_irq_ban(char *path __attribute__((unused)), int irq, GList *proc_interrupts) +{ + struct irq_info find, *res; + GList *entry; + + /* + * Check to see if we banned this irq on the command line + */ + find.irq = irq; + entry = g_list_find_custom(cl_banned_irqs, &find, compare_ints); + if (entry) + return 1; + + /* + * Check to see if we banned module which the irq belongs to. + */ + entry = g_list_find_custom(proc_interrupts, &find, compare_ints); + if (entry) { + res = entry->data; + if (check_for_module_ban(res->name)) + return 1; + } + +#ifdef INCLUDE_BANSCRIPT + char *cmd; + int rc; + + if (!banscript) + return 0; + + if (!path) + return 0; + + cmd = alloca(strlen(path)+strlen(banscript)+32); + if (!cmd) + return 0; + + sprintf(cmd, "%s %s %d > /dev/null",banscript, path, irq); + rc = system(cmd); + + /* + * The system command itself failed + */ + if (rc == -1) { + log(TO_ALL, LOG_WARNING, "%s failed, please check the --banscript option\n", cmd); + return 0; + } + + if (WEXITSTATUS(rc)) { + log(TO_ALL, LOG_INFO, "irq %d is baned by %s\n", irq, banscript); + return 1; + } +#endif + return 0; +} + +/* + * Figures out which interrupt(s) relate to the device we"re looking at in dirname + */ +static void build_one_dev_entry(const char *dirname, GList *tmp_irqs) +{ + struct dirent *entry; + DIR *msidir; + FILE *fd; + int irqnum; + struct irq_info *new; + char path[PATH_MAX]; + char devpath[PATH_MAX]; + struct user_irq_policy pol; + + sprintf(path, "%s/%s/msi_irqs", SYSDEV_DIR, dirname); + sprintf(devpath, "%s/%s", SYSDEV_DIR, dirname); + + msidir = opendir(path); + + if (msidir) { + do { + entry = readdir(msidir); + if (!entry) + break; + irqnum = strtol(entry->d_name, NULL, 10); + if (irqnum) { + new = get_irq_info(irqnum); + if (new) + continue; + get_irq_user_policy(devpath, irqnum, &pol); + if ((pol.ban == 1) || (check_for_irq_ban(devpath, irqnum, tmp_irqs))) { + add_banned_irq(irqnum, &banned_irqs); + continue; + } + new = add_one_irq_to_db(devpath, irqnum, &pol); + if (!new) + continue; + new->type = IRQ_TYPE_MSIX; + } + } while (entry != NULL); + closedir(msidir); + return; + } + + sprintf(path, "%s/%s/irq", SYSDEV_DIR, dirname); + fd = fopen(path, "r"); + if (!fd) + return; + if (fscanf(fd, "%d", &irqnum) < 0) + goto done; + + /* + * no pci device has irq 0 + * irq 255 is invalid on x86/x64 architectures + */ +#if defined(__i386__) || defined(__x86_64__) + if (irqnum && irqnum != 255) { +#else + if (irqnum) { +#endif + new = get_irq_info(irqnum); + if (new) + goto done; + get_irq_user_policy(devpath, irqnum, &pol); + if ((pol.ban == 1) || (check_for_irq_ban(path, irqnum, tmp_irqs))) { + add_banned_irq(irqnum, &banned_irqs); + goto done; + } + + new = add_one_irq_to_db(devpath, irqnum, &pol); + if (!new) + goto done; + new->type = IRQ_TYPE_LEGACY; + } + +done: + fclose(fd); + return; +} + +static void free_irq(struct irq_info *info, void *data __attribute__((unused))) +{ + free(info); +} + +void free_irq_db(void) +{ + for_each_irq(NULL, free_irq, NULL); + g_list_free(interrupts_db); + interrupts_db = NULL; + for_each_irq(banned_irqs, free_irq, NULL); + g_list_free(banned_irqs); + banned_irqs = NULL; + g_list_free(rebalance_irq_list); + rebalance_irq_list = NULL; +} + +void free_cl_opts(void) +{ + g_list_free_full(cl_banned_modules, free); + g_list_free_full(cl_banned_irqs, free); + g_list_free(banned_irqs); +} + +static void add_new_irq(int irq, struct irq_info *hint, GList *proc_interrupts) +{ + struct irq_info *new; + struct user_irq_policy pol; + + new = get_irq_info(irq); + if (new) + return; + + /* Set NULL devpath for the irq has no sysfs entries */ + get_irq_user_policy(NULL, irq, &pol); + if ((pol.ban == 1) || check_for_irq_ban(NULL, irq, proc_interrupts)) { /*FIXME*/ + add_banned_irq(irq, &banned_irqs); + new = get_irq_info(irq); + } else + new = add_one_irq_to_db(NULL, irq, &pol); + + if (!new) { + log(TO_CONSOLE, LOG_WARNING, "add_new_irq: Failed to add irq %d\n", irq); + return; + } + + /* + * Override some of the new irq defaults here + */ + if (hint) { + new->type = hint->type; + new->class = hint->class; + } + + new->level = map_class_to_level[new->class]; +} + +static void add_missing_irq(struct irq_info *info, void *attr) +{ + struct irq_info *lookup = get_irq_info(info->irq); + GList *proc_interrupts = (GList *) attr; + + if (!lookup) + add_new_irq(info->irq, info, proc_interrupts); +} + + +void rebuild_irq_db(void) +{ + DIR *devdir; + struct dirent *entry; + GList *tmp_irqs = NULL; + + free_irq_db(); + + tmp_irqs = collect_full_irq_list(); + + devdir = opendir(SYSDEV_DIR); + if (!devdir) + goto free; + + do { + entry = readdir(devdir); + + if (!entry) + break; + + build_one_dev_entry(entry->d_name, tmp_irqs); + + } while (entry != NULL); + + closedir(devdir); + + + for_each_irq(tmp_irqs, add_missing_irq, interrupts_db); + +free: + g_list_free_full(tmp_irqs, free); + +} + +void for_each_irq(GList *list, void (*cb)(struct irq_info *info, void *data), void *data) +{ + GList *entry = g_list_first(list ? list : interrupts_db); + GList *next; + + while (entry) { + next = g_list_next(entry); + cb(entry->data, data); + entry = next; + } +} + +struct irq_info *get_irq_info(int irq) +{ + GList *entry; + struct irq_info find; + + find.irq = irq; + entry = g_list_find_custom(interrupts_db, &find, compare_ints); + + if (!entry) + entry = g_list_find_custom(banned_irqs, &find, compare_ints); + + return entry ? entry->data : NULL; +} + +void migrate_irq(GList **from, GList **to, struct irq_info *info) +{ + GList *entry; + struct irq_info find, *tmp; + + find.irq = info->irq; + entry = g_list_find_custom(*from, &find, compare_ints); + + if (!entry) + return; + + tmp = entry->data; + *from = g_list_delete_link(*from, entry); + + + *to = g_list_append(*to, tmp); + info->moved = 1; +} + +static gint sort_irqs(gconstpointer A, gconstpointer B) +{ + struct irq_info *a, *b; + + a = (struct irq_info*)A; + b = (struct irq_info*)B; + + if (a->class < b->class) + return 1; + if (a->class > b->class) + return -1; + if (a->load < b->load) + return 1; + if (a->load > b->load) + return -1; + if (a < b) + return 1; + return -1; +} + +void sort_irq_list(GList **list) +{ + *list = g_list_sort(*list, sort_irqs); +} diff --git a/configure.ac b/configure.ac new file mode 100644 index 0000000..f6c60da --- /dev/null +++ b/configure.ac @@ -0,0 +1,91 @@ +AC_INIT(irqbalance,1.3.0) +AC_PREREQ(2.12)dnl +AM_CONFIG_HEADER(config.h) + +AC_CONFIG_MACRO_DIR([m4]) +AM_INIT_AUTOMAKE([foreign] [subdir-objects]) +AM_PROG_LIBTOOL +AC_SUBST(LIBTOOL_DEPS) + +AC_PROG_CC +AC_PROG_INSTALL +AC_PROG_AWK + +AC_ARG_ENABLE([numa], + AS_HELP_STRING([--disable-numa], [enable numa support (default is auto)])) +AS_IF([test "$enable_numa" = "no"],[ + ac_cv_header_numa_h=no + ac_cv_lib_numa_numa_available=no +]) + +AC_HEADER_STDC +AC_CHECK_HEADERS([numa.h]) + +AC_CHECK_FUNCS(getopt_long) + +AC_CHECK_LIB(numa, numa_available) +AC_CHECK_LIB(m, floor) + +PKG_CHECK_MODULES([GLIB2], [glib-2.0], [], [AC_MSG_ERROR([glib-2.0 is required])]) + +PKG_CHECK_MODULES([NCURSESW], [ncursesw], [has_ncursesw=yes], [AC_CHECK_LIB(curses, mvprintw)]) +AS_IF([test "x$has_ncursesw" = "xyes"], [ + AC_SUBST([NCURSESW_CFLAGS]) + AC_SUBST([NCURSESW_LIBS]) + LIBS="$LIBS $NCURSESW_LIBS" + AC_SUBST([LIBS]) +]) + +AC_C_CONST +AC_C_INLINE +AM_PROG_CC_C_O + +AC_ARG_WITH([irqbalance-ui], + [AC_HELP_STRING([--without-irqbalance-ui], + [Dont build the irqbalance ui component])], + [with_irqbalanceui=false], [with_irqbalanceui=true]) + +AM_CONDITIONAL([IRQBALANCEUI], [test x$with_irqbalanceui = xtrue]) + +AC_ARG_WITH([systemd], + [ AS_HELP_STRING([--with-systemd],[Add systemd-lib support])] +) +AS_IF( + [test "x$with_systemd" = xyes], [ + PKG_CHECK_MODULES([SYSTEMD], [libsystemd], [journal_lib=yes], [journal_lib=no]) + AS_IF([test "x$journal_lib" != "xyes"], [ + PKG_CHECK_MODULES([SYSTEMD], [libsystemd-journal], [journal_lib=yes]) + ]) + AC_DEFINE(HAVE_LIBSYSTEMD, 1, [systemd support]) + AC_CHECK_LIB([systemd], [sd_journal_print_with_location]) + AC_CHECK_LIB([systemd], [sd_journal_print]) +]) + +AC_ARG_WITH([libcap-ng], + AS_HELP_STRING([libcap-ng], [Add libcap-ng-support @<:@default=auto@:>@])) + +AS_IF( + [test "x$libcap_ng" != "xno"], + [ + PKG_CHECK_MODULES([LIBCAP_NG], [libcap-ng], + [AC_DEFINE(HAVE_LIBCAP_NG,1,[libcap-ng support])], + [ + AS_IF( + [test "x$libcap_ng" = "xyes"], + [ + AC_MSG_ERROR([libcap-ng not found]) + ] + ) + ] + ) + ] +) + +AC_OUTPUT(Makefile tests/Makefile) + +AC_MSG_NOTICE() +AC_MSG_NOTICE([irqbalance Version: $VERSION]) +AC_MSG_NOTICE([Target: $target]) +AC_MSG_NOTICE([Installation prefix: $prefix]) +AC_MSG_NOTICE([Compiler: $CC]) +AC_MSG_NOTICE([Compiler flags: $CFLAGS]) diff --git a/constants.h b/constants.h new file mode 100644 index 0000000..8e34339 --- /dev/null +++ b/constants.h @@ -0,0 +1,33 @@ +#ifndef __INCLUDE_GUARD_CONSTANTS_H +#define __INCLUDE_GUARD_CONSTANTS_H + +/* interval between rebalance attempts in seconds */ +#define SLEEP_INTERVAL 10 + +#define NSEC_PER_SEC 1e9 + +/* NUMA topology refresh intervals, in units of SLEEP_INTERVAL */ +#define NUMA_REFRESH_INTERVAL 32 +/* NIC interrupt refresh interval, in units of SLEEP_INTERVAL */ +#define NIC_REFRESH_INTERVAL 32 + +/* minimum number of interrupts since boot for an interrupt to matter */ +#define MIN_IRQ_COUNT 20 + + +/* balancing tunings */ + +#define CROSS_PACKAGE_PENALTY 3000 +#define NUMA_PENALTY 500 +#define POWER_MODE_PACKAGE_THRESHOLD 20000 +#define CLASS_VIOLATION_PENTALTY 6000 +#define MSI_CACHE_PENALTY 10000 +#define CORE_SPECIFIC_THRESHOLD 5000 + +/* power mode */ + +#define POWER_MODE_SOFTIRQ_THRESHOLD 20 +#define POWER_MODE_HYSTERESIS 3 + + +#endif diff --git a/cpumask.h b/cpumask.h new file mode 100644 index 0000000..0774a88 --- /dev/null +++ b/cpumask.h @@ -0,0 +1,400 @@ +#ifndef __LINUX_CPUMASK_H +#define __LINUX_CPUMASK_H + +#define NR_CPUS 4096 +/* + * Cpumasks provide a bitmap suitable for representing the + * set of CPU's in a system, one bit position per CPU number. + * + * See detailed comments in the file linux/bitmap.h describing the + * data type on which these cpumasks are based. + * + * For details of cpumask_scnprintf() and cpumask_parse_user(), + * see bitmap_scnprintf() and bitmap_parse_user() in lib/bitmap.c. + * For details of cpulist_scnprintf() and cpulist_parse(), see + * bitmap_scnlistprintf() and bitmap_parselist(), also in bitmap.c. + * For details of cpu_remap(), see bitmap_bitremap in lib/bitmap.c + * For details of cpus_remap(), see bitmap_remap in lib/bitmap.c. + * + * The available cpumask operations are: + * + * void cpu_set(cpu, mask) turn on bit 'cpu' in mask + * void cpu_clear(cpu, mask) turn off bit 'cpu' in mask + * void cpus_setall(mask) set all bits + * void cpus_clear(mask) clear all bits + * int cpu_isset(cpu, mask) true iff bit 'cpu' set in mask + * int cpu_test_and_set(cpu, mask) test and set bit 'cpu' in mask + * + * void cpus_and(dst, src1, src2) dst = src1 & src2 [intersection] + * void cpus_or(dst, src1, src2) dst = src1 | src2 [union] + * void cpus_xor(dst, src1, src2) dst = src1 ^ src2 + * void cpus_andnot(dst, src1, src2) dst = src1 & ~src2 + * void cpus_complement(dst, src) dst = ~src + * + * int cpus_equal(mask1, mask2) Does mask1 == mask2? + * int cpus_intersects(mask1, mask2) Do mask1 and mask2 intersect? + * int cpus_subset(mask1, mask2) Is mask1 a subset of mask2? + * int cpus_empty(mask) Is mask empty (no bits sets)? + * int cpus_full(mask) Is mask full (all bits sets)? + * int cpus_weight(mask) Hamming weigh - number of set bits + * + * void cpus_shift_right(dst, src, n) Shift right + * void cpus_shift_left(dst, src, n) Shift left + * + * int first_cpu(mask) Number lowest set bit, or NR_CPUS + * int next_cpu(cpu, mask) Next cpu past 'cpu', or NR_CPUS + * + * cpumask_t cpumask_of_cpu(cpu) Return cpumask with bit 'cpu' set + * CPU_MASK_ALL Initializer - all bits set + * CPU_MASK_NONE Initializer - no bits set + * unsigned long *cpus_addr(mask) Array of unsigned long's in mask + * + * int cpumask_scnprintf(buf, len, mask) Format cpumask for printing + * int cpumask_parse_user(ubuf, ulen, mask) Parse ascii string as cpumask + * int cpulist_scnprintf(buf, len, mask) Format cpumask as list for printing + * int cpulist_parse(buf, map) Parse ascii string as cpulist + * int cpu_remap(oldbit, old, new) newbit = map(old, new)(oldbit) + * int cpus_remap(dst, src, old, new) *dst = map(old, new)(src) + * + * for_each_cpu_mask(cpu, mask) for-loop cpu over mask + * + * int num_online_cpus() Number of online CPUs + * int num_possible_cpus() Number of all possible CPUs + * int num_present_cpus() Number of present CPUs + * + * int cpu_online(cpu) Is some cpu online? + * int cpu_possible(cpu) Is some cpu possible? + * int cpu_present(cpu) Is some cpu present (can schedule)? + * + * int any_online_cpu(mask) First online cpu in mask + * + * for_each_possible_cpu(cpu) for-loop cpu over cpu_possible_map + * for_each_online_cpu(cpu) for-loop cpu over cpu_online_map + * for_each_present_cpu(cpu) for-loop cpu over cpu_present_map + * + * Subtlety: + * 1) The 'type-checked' form of cpu_isset() causes gcc (3.3.2, anyway) + * to generate slightly worse code. Note for example the additional + * 40 lines of assembly code compiling the "for each possible cpu" + * loops buried in the disk_stat_read() macros calls when compiling + * drivers/block/genhd.c (arch i386, CONFIG_SMP=y). So use a simple + * one-line #define for cpu_isset(), instead of wrapping an inline + * inside a macro, the way we do the other calls. + */ + +#include "bitmap.h" + +typedef struct { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t; +extern cpumask_t _unused_cpumask_arg_; + +#define cpu_set(cpu, dst) __cpu_set((cpu), &(dst)) +static inline void __cpu_set(int cpu, volatile cpumask_t *dstp) +{ + set_bit(cpu, dstp->bits); +} + +#define cpu_clear(cpu, dst) __cpu_clear((cpu), &(dst)) +static inline void __cpu_clear(int cpu, volatile cpumask_t *dstp) +{ + clear_bit(cpu, dstp->bits); +} + +#define cpus_setall(dst) __cpus_setall(&(dst), NR_CPUS) +static inline void __cpus_setall(cpumask_t *dstp, int nbits) +{ + bitmap_fill(dstp->bits, nbits); +} + +#define cpus_clear(dst) __cpus_clear(&(dst), NR_CPUS) +static inline void __cpus_clear(cpumask_t *dstp, int nbits) +{ + bitmap_zero(dstp->bits, nbits); +} + +/* No static inline type checking - see Subtlety (1) above. */ +#define cpu_isset(cpu, cpumask) test_bit((cpu), (cpumask).bits) + +#define cpus_and(dst, src1, src2) __cpus_and(&(dst), &(src1), &(src2), NR_CPUS) +static inline void __cpus_and(cpumask_t *dstp, const cpumask_t *src1p, + const cpumask_t *src2p, int nbits) +{ + bitmap_and(dstp->bits, src1p->bits, src2p->bits, nbits); +} + +#define cpus_or(dst, src1, src2) __cpus_or(&(dst), &(src1), &(src2), NR_CPUS) +static inline void __cpus_or(cpumask_t *dstp, const cpumask_t *src1p, + const cpumask_t *src2p, int nbits) +{ + bitmap_or(dstp->bits, src1p->bits, src2p->bits, nbits); +} + +#define cpus_xor(dst, src1, src2) __cpus_xor(&(dst), &(src1), &(src2), NR_CPUS) +static inline void __cpus_xor(cpumask_t *dstp, const cpumask_t *src1p, + const cpumask_t *src2p, int nbits) +{ + bitmap_xor(dstp->bits, src1p->bits, src2p->bits, nbits); +} + +#define cpus_andnot(dst, src1, src2) \ + __cpus_andnot(&(dst), &(src1), &(src2), NR_CPUS) +static inline void __cpus_andnot(cpumask_t *dstp, const cpumask_t *src1p, + const cpumask_t *src2p, int nbits) +{ + bitmap_andnot(dstp->bits, src1p->bits, src2p->bits, nbits); +} + +#define cpus_complement(dst, src) __cpus_complement(&(dst), &(src), NR_CPUS) +static inline void __cpus_complement(cpumask_t *dstp, + const cpumask_t *srcp, int nbits) +{ + bitmap_complement(dstp->bits, srcp->bits, nbits); +} + +#define cpus_equal(src1, src2) __cpus_equal(&(src1), &(src2), NR_CPUS) +static inline int __cpus_equal(const cpumask_t *src1p, + const cpumask_t *src2p, int nbits) +{ + return bitmap_equal(src1p->bits, src2p->bits, nbits); +} + +#define cpus_intersects(src1, src2) __cpus_intersects(&(src1), &(src2), NR_CPUS) +static inline int __cpus_intersects(const cpumask_t *src1p, + const cpumask_t *src2p, int nbits) +{ + return bitmap_intersects(src1p->bits, src2p->bits, nbits); +} + +#define cpus_subset(src1, src2) __cpus_subset(&(src1), &(src2), NR_CPUS) +static inline int __cpus_subset(const cpumask_t *src1p, + const cpumask_t *src2p, int nbits) +{ + return bitmap_subset(src1p->bits, src2p->bits, nbits); +} + +#define cpus_empty(src) __cpus_empty(&(src), NR_CPUS) +static inline int __cpus_empty(const cpumask_t *srcp, int nbits) +{ + return bitmap_empty(srcp->bits, nbits); +} + +#define cpus_full(cpumask) __cpus_full(&(cpumask), NR_CPUS) +static inline int __cpus_full(const cpumask_t *srcp, int nbits) +{ + return bitmap_full(srcp->bits, nbits); +} + +#define cpus_weight(cpumask) __cpus_weight(&(cpumask), NR_CPUS) +static inline int __cpus_weight(const cpumask_t *srcp, int nbits) +{ + return bitmap_weight(srcp->bits, nbits); +} + +#define cpus_shift_right(dst, src, n) \ + __cpus_shift_right(&(dst), &(src), (n), NR_CPUS) +static inline void __cpus_shift_right(cpumask_t *dstp, + const cpumask_t *srcp, int n, int nbits) +{ + bitmap_shift_right(dstp->bits, srcp->bits, n, nbits); +} + +#define cpus_shift_left(dst, src, n) \ + __cpus_shift_left(&(dst), &(src), (n), NR_CPUS) +static inline void __cpus_shift_left(cpumask_t *dstp, + const cpumask_t *srcp, int n, int nbits) +{ + bitmap_shift_left(dstp->bits, srcp->bits, n, nbits); +} + +static inline int __first_cpu(const cpumask_t *srcp) +{ + return ffs(*srcp->bits)-1; +} + +#define first_cpu(src) __first_cpu(&(src)) +int __next_cpu(int n, const cpumask_t *srcp); +#define next_cpu(n, src) __next_cpu((n), &(src)) + +#define cpumask_of_cpu(cpu) \ +({ \ + typeof(_unused_cpumask_arg_) m; \ + if (sizeof(m) == sizeof(unsigned long)) { \ + m.bits[0] = 1UL<<(cpu); \ + } else { \ + cpus_clear(m); \ + cpu_set((cpu), m); \ + } \ + m; \ +}) + +#define CPU_MASK_LAST_WORD BITMAP_LAST_WORD_MASK(NR_CPUS) + +#if 0 + +#define CPU_MASK_ALL \ +(cpumask_t) { { \ + [BITS_TO_LONGS(NR_CPUS)-1] = CPU_MASK_LAST_WORD \ +} } + +#else + +#define CPU_MASK_ALL \ +(cpumask_t) { { \ + [0 ... BITS_TO_LONGS(NR_CPUS)-2] = ~0UL, \ + [BITS_TO_LONGS(NR_CPUS)-1] = CPU_MASK_LAST_WORD \ +} } + +#endif + +#define CPU_MASK_NONE \ +(cpumask_t) { { \ + [0 ... BITS_TO_LONGS(NR_CPUS)-1] = 0UL \ +} } + +#define CPU_MASK_CPU0 \ +(cpumask_t) { { \ + [0] = 1UL \ +} } + +#define cpus_addr(src) ((src).bits) + +#define cpumask_scnprintf(buf, len, src) \ + __cpumask_scnprintf((buf), (len), &(src), NR_CPUS) +static inline int __cpumask_scnprintf(char *buf, int len, + const cpumask_t *srcp, int nbits) +{ + return bitmap_scnprintf(buf, len, srcp->bits, nbits); +} + +#define cpumask_parse_user(ubuf, ulen, dst) \ + __cpumask_parse_user((ubuf), (ulen), &(dst), NR_CPUS) +static inline int __cpumask_parse_user(const char *buf, int len, + cpumask_t *dstp, int nbits) +{ + return bitmap_parse(buf, len, dstp->bits, nbits); +} + +#define cpulist_scnprintf(buf, len, src) \ + __cpulist_scnprintf((buf), (len), &(src), NR_CPUS) +static inline int __cpulist_scnprintf(char *buf, int len, + const cpumask_t *srcp, int nbits) +{ + return bitmap_scnlistprintf(buf, len, srcp->bits, nbits); +} + +#define cpulist_parse(buf, len, dst) __cpulist_parse((buf), (len), &(dst), NR_CPUS) +static inline int __cpulist_parse(const char *buf, int len, cpumask_t *dstp, int nbits) +{ + return bitmap_parselist(buf, len, dstp->bits, nbits); +} + +#define cpu_remap(oldbit, old, new) \ + __cpu_remap((oldbit), &(old), &(new), NR_CPUS) +static inline int __cpu_remap(int oldbit, + const cpumask_t *oldp, const cpumask_t *newp, int nbits) +{ + return bitmap_bitremap(oldbit, oldp->bits, newp->bits, nbits); +} + +#define cpus_remap(dst, src, old, new) \ + __cpus_remap(&(dst), &(src), &(old), &(new), NR_CPUS) +static inline void __cpus_remap(cpumask_t *dstp, const cpumask_t *srcp, + const cpumask_t *oldp, const cpumask_t *newp, int nbits) +{ + bitmap_remap(dstp->bits, srcp->bits, oldp->bits, newp->bits, nbits); +} + +#if NR_CPUS > 1 +#define for_each_cpu_mask(cpu, mask) \ + for ((cpu) = first_cpu(mask); \ + (cpu) < NR_CPUS; \ + (cpu) = next_cpu((cpu), (mask))) +#else /* NR_CPUS == 1 */ +#define for_each_cpu_mask(cpu, mask) \ + for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask) +#endif /* NR_CPUS */ + +/* + * The following particular system cpumasks and operations manage + * possible, present and online cpus. Each of them is a fixed size + * bitmap of size NR_CPUS. + * + * #ifdef CONFIG_HOTPLUG_CPU + * cpu_possible_map - has bit 'cpu' set iff cpu is populatable + * cpu_present_map - has bit 'cpu' set iff cpu is populated + * cpu_online_map - has bit 'cpu' set iff cpu available to scheduler + * #else + * cpu_possible_map - has bit 'cpu' set iff cpu is populated + * cpu_present_map - copy of cpu_possible_map + * cpu_online_map - has bit 'cpu' set iff cpu available to scheduler + * #endif + * + * In either case, NR_CPUS is fixed at compile time, as the static + * size of these bitmaps. The cpu_possible_map is fixed at boot + * time, as the set of CPU id's that it is possible might ever + * be plugged in at anytime during the life of that system boot. + * The cpu_present_map is dynamic(*), representing which CPUs + * are currently plugged in. And cpu_online_map is the dynamic + * subset of cpu_present_map, indicating those CPUs available + * for scheduling. + * + * If HOTPLUG is enabled, then cpu_possible_map is forced to have + * all NR_CPUS bits set, otherwise it is just the set of CPUs that + * ACPI reports present at boot. + * + * If HOTPLUG is enabled, then cpu_present_map varies dynamically, + * depending on what ACPI reports as currently plugged in, otherwise + * cpu_present_map is just a copy of cpu_possible_map. + * + * (*) Well, cpu_present_map is dynamic in the hotplug case. If not + * hotplug, it's a copy of cpu_possible_map, hence fixed at boot. + * + * Subtleties: + * 1) UP arch's (NR_CPUS == 1, CONFIG_SMP not defined) hardcode + * assumption that their single CPU is online. The UP + * cpu_{online,possible,present}_maps are placebos. Changing them + * will have no useful affect on the following num_*_cpus() + * and cpu_*() macros in the UP case. This ugliness is a UP + * optimization - don't waste any instructions or memory references + * asking if you're online or how many CPUs there are if there is + * only one CPU. + * 2) Most SMP arch's #define some of these maps to be some + * other map specific to that arch. Therefore, the following + * must be #define macros, not inlines. To see why, examine + * the assembly code produced by the following. Note that + * set1() writes phys_x_map, but set2() writes x_map: + * int x_map, phys_x_map; + * #define set1(a) x_map = a + * inline void set2(int a) { x_map = a; } + * #define x_map phys_x_map + * main(){ set1(3); set2(5); } + */ + +extern cpumask_t cpu_possible_map; +extern cpumask_t cpu_online_map; +extern cpumask_t cpu_present_map; + +#if NR_CPUS > 1 +#define num_online_cpus() cpus_weight(cpu_online_map) +#define num_possible_cpus() cpus_weight(cpu_possible_map) +#define num_present_cpus() cpus_weight(cpu_present_map) +#define cpu_online(cpu) cpu_isset((cpu), cpu_online_map) +#define cpu_possible(cpu) cpu_isset((cpu), cpu_possible_map) +#define cpu_present(cpu) cpu_isset((cpu), cpu_present_map) +#else +#define num_online_cpus() 1 +#define num_possible_cpus() 1 +#define num_present_cpus() 1 +#define cpu_online(cpu) ((cpu) == 0) +#define cpu_possible(cpu) ((cpu) == 0) +#define cpu_present(cpu) ((cpu) == 0) +#endif + +int highest_possible_processor_id(void); +#define any_online_cpu(mask) __any_online_cpu(&(mask)) +int __any_online_cpu(const cpumask_t *mask); + +#define for_each_possible_cpu(cpu) for_each_cpu_mask((cpu), cpu_possible_map) +#define for_each_online_cpu(cpu) for_each_cpu_mask((cpu), cpu_online_map) +#define for_each_present_cpu(cpu) for_each_cpu_mask((cpu), cpu_present_map) + +#endif /* __LINUX_CPUMASK_H */ diff --git a/cputree.c b/cputree.c new file mode 100644 index 0000000..d09af43 --- /dev/null +++ b/cputree.c @@ -0,0 +1,589 @@ +/* + * Copyright (C) 2006, Intel Corporation + * Copyright (C) 2012, Neil Horman + * + * This file is part of irqbalance + * + * This program file is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; version 2 of the License. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License + * for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program in a file named COPYING; if not, write to the + * Free Software Foundation, Inc., + * 51 Franklin Street, Fifth Floor, + * Boston, MA 02110-1301 USA + */ + +/* + * This file contains the code to construct and manipulate a hierarchy of processors, + * cache domains and processor cores. + */ + +#include "config.h" +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "irqbalance.h" + +extern char *banned_cpumask_from_ui; + +GList *cpus; +GList *cache_domains; +GList *packages; + +int package_count; +int cache_domain_count; +int core_count; + +/* Users want to be able to keep interrupts away from some cpus; store these in a cpumask_t */ +cpumask_t banned_cpus; + +cpumask_t cpu_possible_map; + +/* + it's convenient to have the complement of banned_cpus available so that + the AND operator can be used to mask out unwanted cpus +*/ +cpumask_t unbanned_cpus; + +/* + * By default do not place IRQs on CPUs the kernel keeps isolated or + * nohz_full, as specified through the boot commandline. Users can + * override this with the IRQBALANCE_BANNED_CPUS environment variable. + */ +static void setup_banned_cpus(void) +{ + FILE *file; + char *line = NULL; + size_t size = 0; + char buffer[4096]; + cpumask_t nohz_full; + cpumask_t isolated_cpus; + + cpus_clear(isolated_cpus); + cpus_clear(nohz_full); + + /* A manually specified cpumask overrides auto-detection. */ + if (banned_cpumask_from_ui != NULL) { + cpulist_parse(banned_cpumask_from_ui, + strlen(banned_cpumask_from_ui), banned_cpus); + goto out; + } + if (getenv("IRQBALANCE_BANNED_CPUS")) { + cpumask_parse_user(getenv("IRQBALANCE_BANNED_CPUS"), strlen(getenv("IRQBALANCE_BANNED_CPUS")), banned_cpus); + goto out; + } + file = fopen("/sys/devices/system/cpu/isolated", "r"); + if (file) { + if (getline(&line, &size, file) > 0) { + if (strlen(line) && line[0] != '\n') + cpulist_parse(line, strlen(line), isolated_cpus); + free(line); + line = NULL; + size = 0; + } + fclose(file); + } + + file = fopen("/sys/devices/system/cpu/nohz_full", "r"); + if (file) { + if (getline(&line, &size, file) > 0) { + if (strlen(line) && line[0] != '\n') + cpulist_parse(line, strlen(line), nohz_full); + free(line); + line = NULL; + size = 0; + } + fclose(file); + } + + cpus_or(banned_cpus, nohz_full, isolated_cpus); + +out: + cpumask_scnprintf(buffer, 4096, isolated_cpus); + log(TO_CONSOLE, LOG_INFO, "Isolated CPUs: %s\n", buffer); + cpumask_scnprintf(buffer, 4096, nohz_full); + log(TO_CONSOLE, LOG_INFO, "Adaptive-ticks CPUs: %s\n", buffer); + cpumask_scnprintf(buffer, 4096, banned_cpus); + log(TO_CONSOLE, LOG_INFO, "Banned CPUs: %s\n", buffer); +} + +static void add_numa_node_to_topo_obj(struct topo_obj *obj, int nodeid) +{ + GList *entry; + struct topo_obj *node; + struct topo_obj *cand_node; + + node = get_numa_node(nodeid); + if (!node || node->number == -1) + return; + + entry = g_list_first(obj->numa_nodes); + while (entry) { + cand_node = entry->data; + if (cand_node == node) + break; + entry = g_list_next(entry); + } + + if (!entry) + obj->numa_nodes = g_list_append(obj->numa_nodes, node); +} + +static struct topo_obj* add_cache_domain_to_package(struct topo_obj *cache, + int packageid, + cpumask_t package_mask, + int nodeid) +{ + GList *entry; + struct topo_obj *package; + struct topo_obj *lcache; + + entry = g_list_first(packages); + + while (entry) { + package = entry->data; + if (cpus_equal(package_mask, package->mask)) { + if (packageid != package->number) + log(TO_ALL, LOG_WARNING, "package_mask with different physical_package_id found!\n"); + break; + } + entry = g_list_next(entry); + } + + if (!entry) { + package = calloc(sizeof(struct topo_obj), 1); + if (!package) + return NULL; + package->mask = package_mask; + package->obj_type = OBJ_TYPE_PACKAGE; + package->obj_type_list = &packages; + package->number = packageid; + packages = g_list_append(packages, package); + package_count++; + } + + entry = g_list_first(package->children); + while (entry) { + lcache = entry->data; + if (lcache == cache) + break; + entry = g_list_next(entry); + } + + if (!entry) { + package->children = g_list_append(package->children, cache); + cache->parent = package; + } + + if (nodeid > -1) + add_numa_node_to_topo_obj(package, nodeid); + + return package; +} +static struct topo_obj* add_cpu_to_cache_domain(struct topo_obj *cpu, + cpumask_t cache_mask, + int nodeid) +{ + GList *entry; + struct topo_obj *cache; + struct topo_obj *lcpu; + + entry = g_list_first(cache_domains); + + while (entry) { + cache = entry->data; + if (cpus_equal(cache_mask, cache->mask)) + break; + entry = g_list_next(entry); + } + + if (!entry) { + cache = calloc(sizeof(struct topo_obj), 1); + if (!cache) + return NULL; + cache->obj_type = OBJ_TYPE_CACHE; + cache->mask = cache_mask; + cache->number = cache_domain_count; + cache->obj_type_list = &cache_domains; + cache_domains = g_list_append(cache_domains, cache); + cache_domain_count++; + } + + entry = g_list_first(cache->children); + while (entry) { + lcpu = entry->data; + if (lcpu == cpu) + break; + entry = g_list_next(entry); + } + + if (!entry) { + cache->children = g_list_append(cache->children, cpu); + cpu->parent = (struct topo_obj *)cache; + } + + if (nodeid > -1) + add_numa_node_to_topo_obj(cache, nodeid); + + return cache; +} + +static void do_one_cpu(char *path) +{ + struct topo_obj *cpu; + FILE *file; + char new_path[PATH_MAX]; + cpumask_t cache_mask, package_mask; + struct topo_obj *cache; + DIR *dir; + struct dirent *entry; + int nodeid; + int packageid = 0; + unsigned int max_cache_index, cache_index, cache_stat; + + /* skip offline cpus */ + snprintf(new_path, PATH_MAX, "%s/online", path); + file = fopen(new_path, "r"); + if (file) { + char *line = NULL; + size_t size = 0; + if (getline(&line, &size, file)==0) + return; + fclose(file); + if (line && line[0]=='0') { + free(line); + return; + } + free(line); + } + + cpu = calloc(sizeof(struct topo_obj), 1); + if (!cpu) + return; + + cpu->obj_type = OBJ_TYPE_CPU; + + cpu->number = strtoul(&path[27], NULL, 10); + + cpu_set(cpu->number, cpu_possible_map); + + cpu_set(cpu->number, cpu->mask); + + /* + * Default the cache_domain mask to be equal to the cpu + */ + cpus_clear(cache_mask); + cpu_set(cpu->number, cache_mask); + + /* if the cpu is on the banned list, just don't add it */ + if (cpus_intersects(cpu->mask, banned_cpus)) { + free(cpu); + /* even though we don't use the cpu we do need to count it */ + core_count++; + return; + } + + + /* try to read the package mask; if it doesn't exist assume solitary */ + snprintf(new_path, PATH_MAX, "%s/topology/core_siblings", path); + file = fopen(new_path, "r"); + cpu_set(cpu->number, package_mask); + if (file) { + char *line = NULL; + size_t size = 0; + if (getline(&line, &size, file)) + cpumask_parse_user(line, strlen(line), package_mask); + fclose(file); + free(line); + } + /* try to read the package id */ + snprintf(new_path, PATH_MAX, "%s/topology/physical_package_id", path); + file = fopen(new_path, "r"); + if (file) { + char *line = NULL; + size_t size = 0; + if (getline(&line, &size, file)) + packageid = strtoul(line, NULL, 10); + fclose(file); + free(line); + } + + /* try to read the cache mask; if it doesn't exist assume solitary */ + /* We want the deepest cache level available */ + cpu_set(cpu->number, cache_mask); + max_cache_index = 0; + cache_index = 1; + do { + struct stat sb; + snprintf(new_path, PATH_MAX, "%s/cache/index%d/shared_cpu_map", path, cache_index); + cache_stat = stat(new_path, &sb); + if (!cache_stat) { + max_cache_index = cache_index; + if (max_cache_index == deepest_cache) + break; + cache_index ++; + } + } while(!cache_stat); + + if (max_cache_index > 0) { + snprintf(new_path, PATH_MAX, "%s/cache/index%d/shared_cpu_map", path, max_cache_index); + file = fopen(new_path, "r"); + if (file) { + char *line = NULL; + size_t size = 0; + if (getline(&line, &size, file)) + cpumask_parse_user(line, strlen(line), cache_mask); + fclose(file); + free(line); + } + } + + nodeid=-1; + if (numa_avail) { + struct topo_obj *node; + + dir = opendir(path); + do { + entry = readdir(dir); + if (!entry) + break; + if (strstr(entry->d_name, "node")) { + nodeid = strtoul(&entry->d_name[4], NULL, 10); + break; + } + } while (entry); + closedir(dir); + + /* + * In case of multiple NUMA nodes within a CPU package, + * we override package_mask with node mask. + */ + node = get_numa_node(nodeid); + if (node && (cpus_weight(package_mask) > cpus_weight(node->mask))) + cpus_and(package_mask, package_mask, node->mask); + } + + /* + blank out the banned cpus from the various masks so that interrupts + will never be told to go there + */ + cpus_and(cache_mask, cache_mask, unbanned_cpus); + cpus_and(package_mask, package_mask, unbanned_cpus); + + cache = add_cpu_to_cache_domain(cpu, cache_mask, nodeid); + add_cache_domain_to_package(cache, packageid, package_mask, + nodeid); + + cpu->obj_type_list = &cpus; + cpus = g_list_append(cpus, cpu); + core_count++; +} + +static void dump_irq(struct irq_info *info, void *data) +{ + int spaces = (long int)data; + int i; + char * indent = malloc (sizeof(char) * (spaces + 1)); + + for ( i = 0; i < spaces; i++ ) + indent[i] = log_indent[0]; + + indent[i] = '\0'; + log(TO_CONSOLE, LOG_INFO, "%sInterrupt %i node_num is %d (%s/%lu:%lu) \n", indent, + info->irq, irq_numa_node(info)->number, classes[info->class], info->load, (info->irq_count - info->last_irq_count)); + free(indent); +} + +static void dump_numa_node_num(struct topo_obj *p, void *data __attribute__((unused))) +{ + log(TO_CONSOLE, LOG_INFO, "%d ", p->number); +} + +static void dump_balance_obj(struct topo_obj *d, void *data __attribute__((unused))) +{ + struct topo_obj *c = (struct topo_obj *)d; + log(TO_CONSOLE, LOG_INFO, "%s%s%s%sCPU number %i numa_node is ", + log_indent, log_indent, log_indent, log_indent, c->number); + for_each_object(cpu_numa_node(c), dump_numa_node_num, NULL); + log(TO_CONSOLE, LOG_INFO, "(load %lu)\n", (unsigned long)c->load); + if (c->interrupts) + for_each_irq(c->interrupts, dump_irq, (void *)18); +} + +static void dump_cache_domain(struct topo_obj *d, void *data) +{ + char *buffer = data; + cpumask_scnprintf(buffer, 4095, d->mask); + log(TO_CONSOLE, LOG_INFO, "%s%sCache domain %i: numa_node is ", + log_indent, log_indent, d->number); + for_each_object(d->numa_nodes, dump_numa_node_num, NULL); + log(TO_CONSOLE, LOG_INFO, "cpu mask is %s (load %lu) \n", buffer, + (unsigned long)d->load); + if (d->children) + for_each_object(d->children, dump_balance_obj, NULL); + if (g_list_length(d->interrupts) > 0) + for_each_irq(d->interrupts, dump_irq, (void *)10); +} + +static void dump_package(struct topo_obj *d, void *data) +{ + char *buffer = data; + cpumask_scnprintf(buffer, 4096, d->mask); + log(TO_CONSOLE, LOG_INFO, "Package %i: numa_node ", d->number); + for_each_object(d->numa_nodes, dump_numa_node_num, NULL); + log(TO_CONSOLE, LOG_INFO, "cpu mask is %s (load %lu)\n", + buffer, (unsigned long)d->load); + if (d->children) + for_each_object(d->children, dump_cache_domain, buffer); + if (g_list_length(d->interrupts) > 0) + for_each_irq(d->interrupts, dump_irq, (void *)2); +} + +void dump_tree(void) +{ + char buffer[4096]; + for_each_object(packages, dump_package, buffer); +} + +static void clear_irq_stats(struct irq_info *info, void *data __attribute__((unused))) +{ + info->load = 0; +} + +static void clear_obj_stats(struct topo_obj *d, void *data __attribute__((unused))) +{ + for_each_object(d->children, clear_obj_stats, NULL); + for_each_irq(d->interrupts, clear_irq_stats, NULL); +} + +/* + * this function removes previous state from the cpu tree, such as + * which level does how much work and the actual lists of interrupts + * assigned to each component + */ +void clear_work_stats(void) +{ + for_each_object(numa_nodes, clear_obj_stats, NULL); +} + + +void parse_cpu_tree(void) +{ + DIR *dir; + struct dirent *entry; + + setup_banned_cpus(); + + cpus_complement(unbanned_cpus, banned_cpus); + + dir = opendir("/sys/devices/system/cpu"); + if (!dir) + return; + do { + int num; + char pad; + entry = readdir(dir); + /* + * We only want to count real cpus, not cpufreq and + * cpuidle + */ + if (entry && + sscanf(entry->d_name, "cpu%d%c", &num, &pad) == 1 && + !strchr(entry->d_name, ' ')) { + char new_path[PATH_MAX]; + sprintf(new_path, "/sys/devices/system/cpu/%s", entry->d_name); + do_one_cpu(new_path); + } + } while (entry); + closedir(dir); + for_each_object(packages, connect_cpu_mem_topo, NULL); + + if (debug_mode) + dump_tree(); + +} + + +/* + * This function frees all memory related to a cpu tree so that a new tree + * can be read + */ +void clear_cpu_tree(void) +{ + GList *item; + struct topo_obj *cpu; + struct topo_obj *cache_domain; + struct topo_obj *package; + + while (packages) { + item = g_list_first(packages); + package = item->data; + g_list_free(package->children); + g_list_free(package->interrupts); + g_list_free(package->numa_nodes); + free(package); + packages = g_list_delete_link(packages, item); + } + package_count = 0; + + while (cache_domains) { + item = g_list_first(cache_domains); + cache_domain = item->data; + g_list_free(cache_domain->children); + g_list_free(cache_domain->interrupts); + g_list_free(cache_domain->numa_nodes); + free(cache_domain); + cache_domains = g_list_delete_link(cache_domains, item); + } + cache_domain_count = 0; + + + while (cpus) { + item = g_list_first(cpus); + cpu = item->data; + g_list_free(cpu->interrupts); + free(cpu); + cpus = g_list_delete_link(cpus, item); + } + core_count = 0; + +} + +static gint compare_cpus(gconstpointer a, gconstpointer b) +{ + const struct topo_obj *ai = a; + const struct topo_obj *bi = b; + + return ai->number - bi->number; +} + +struct topo_obj *find_cpu_core(int cpunr) +{ + GList *entry; + struct topo_obj find; + + find.number = cpunr; + entry = g_list_find_custom(cpus, &find, compare_cpus); + + return entry ? entry->data : NULL; +} + +int get_cpu_count(void) +{ + return g_list_length(cpus); +} + diff --git a/irqbalance.1 b/irqbalance.1 new file mode 100644 index 0000000..68e3cf8 --- /dev/null +++ b/irqbalance.1 @@ -0,0 +1,167 @@ +.de Sh \" Subsection +.br +.if t .Sp +.ne 5 +.PP +\fB\\$1\fR +.PP +.. +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Ip \" List item +.br +.ie \\n(.$>=3 .ne \\$3 +.el .ne 3 +.IP "\\$1" \\$2 +.. +.TH "IRQBALANCE" 1 "Dec 2006" "Linux" "irqbalance" +.SH NAME +irqbalance \- distribute hardware interrupts across processors on a multiprocessor system +.SH "SYNOPSIS" + +.nf +\fBirqbalance\fR +.fi + +.SH "DESCRIPTION" + +.PP +The purpose of \fBirqbalance\fR is to distribute hardware interrupts across +processors on a multiprocessor system in order to increase performance\&. + +.SH "OPTIONS" + +.TP +.B -o, --oneshot +Causes irqbalance to be run once, after which the daemon exits. +.TP + +.B -d, --debug +Causes irqbalance to print extra debug information. Implies --foreground. + +.TP +.B -f, --foreground +Causes irqbalance to run in the foreground (without --debug). + +.TP +.B -j, --journal +Enables log output optimized for systemd-journal. + +.TP +.B -p, --powerthresh= +Set the threshold at which we attempt to move a CPU into powersave mode +If more than CPUs are more than 1 standard deviation below the +average CPU softirq workload, and no CPUs are more than 1 standard deviation +above (and have more than 1 IRQ assigned to them), attempt to place 1 CPU in +powersave mode. In powersave mode, a CPU will not have any IRQs balanced to it, +in an effort to prevent that CPU from waking up without need. + +.TP +.B -i, --banirq= +Add the specified IRQ to the set of banned IRQs. irqbalance will not affect +the affinity of any IRQs on the banned list, allowing them to be specified +manually. This option is addative and can be specified multiple times. For +example to ban IRQs 43 and 44 from balancing, use the following command line: +.B irqbalance --banirq=43 --banirq=44 + +.TP +.B --deepestcache= +This allows a user to specify the cache level at which irqbalance partitions +cache domains. Specifying a deeper cache may allow a greater degree of +flexibility for irqbalance to assign IRQ affinity to achieve greater performance +increases, but setting a cache depth too large on some systems (specifically +where all CPUs on a system share the deepest cache level), will cause irqbalance +to see balancing as unnecessary. +.B irqbalance --deepestcache=2 +.P +The default value for deepestcache is 2. + +.TP +.B -l, --policyscript=