Monday, June 18, 2007

Rebuilding a 3Ware Raid set in linux

This information is specific to the 3Ware 9500 Series controller. (More specifically, the 9500-4LP). However, the 3Ware CLI seems to be the same for other 3Ware 9XXX controllers which I have had experience with. (The 9550 for sure)


Under linux, the 3Ware cards can be manipulated through the "tw_cli" command. (The CLI tools can be downloaded for free from 3Ware's support website)

A healthy RAID set looks like this:

dev306:~# /opt/3Ware/bin/tw_cli
//dev306> info c0

Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC
------------------------------------------------------------------------------
u0 RAID-5 OK - 256K 1117.56 ON OFF OFF

Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 372.61 GB 781422768 3PM0Q56Z
p1 OK u0 372.61 GB 781422768 3PM0Q3YY
p2 OK u0 372.61 GB 781422768 3PM0PFT7
p3 OK u0 372.61 GB 781422768 3PM0Q3B7


A failed RAID set looks like this:

dev306:~# /opt/3Ware/bin/tw_cli
//dev306> info c0

Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC
------------------------------------------------------------------------------
u0 RAID-5 DEGRADED - 256K 1117.56 ON OFF OFF

Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 372.61 GB 781422768 3PM0Q56Z
p1 OK u0 372.61 GB 781422768 3PM0Q3YY
p2 OK u0 372.61 GB 781422768 3PM0PFT7
p3 DEGRADED u0 372.61 GB 781422768 3PM0Q3B7


Now I will remove this bad disk from the RAID set:


//dev306> maint remove c0 p3
Exporting port /c0/p3 ... Done.




I now need to physically replace the bad drive. Unfortunately since our vendor wired some of our cables cockeyed, I will usually cause some I/O on the disks at this point, to see which of the four disks is "actually" bad. (Hint: The one with no lights on is the bad one.)


dev306:~# find /opt -type f -exec cat '{}' > /dev/null \;


With the bad disk identified and replaced, now I need to go back into the 3Ware CLI and find the new disk, then tell the array to start rebuilding.


dev306:~# /opt/3Ware/bin/tw_cli
//dev306> maint rescan
Rescanning controller /c0 for units and drives ...Done.
Found the following unit(s): [none].
Found the following drive(s): [/c0/p3].


//dev306> maint rebuild c0 u0 p3
Sending rebuild start request to /c0/u0 on 1 disk(s) [3] ... Done.

//dev306> info c0

Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC
------------------------------------------------------------------------------
u0 RAID-5 REBUILDING 0 256K 1117.56 ON OFF OFF

Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 372.61 GB 781422768 3PM0Q56Z
p1 OK u0 372.61 GB 781422768 3PM0Q3YY
p2 OK u0 372.61 GB 781422768 3PM0PFT7
p3 DEGRADED u0 372.61 GB 781422768 3PM0Q3B7


Note that p3 still shows a status of "DEGRADED" but now the array itself is "REBUILDING". Under minimal IO load, a RAID-5 with 400GB disks such as this one will take about 2.5 hours to rebuild.

Supermicro H8DAR-T BIOS Settings

We run a lot of Supermicro H8DAR-T motherboards in production. These are the BIOS settings that work well for us. I have not done a lot of tweaking trying to get more performance out of our systems with BIOS settings, since stability is key.

Note that unless specified here, we leave the settings at their default values. (Some of these settings are default values but documented because we need them set that way) Especially important options in BOLD.


Advanced->ACPI Settings->Advanced ACPI Settings
ACPI 2.0 [No]
ACPI APIC Support [Enabled]
ACPI SRAT Table [Enabled]
BIOS->AML ACPI Table [Enabled]
Headless Mode [Enabled]
OS Console Redirection [Always]

Advanced->AMD PowerNow Configuration
PowerNow [Disabled]

Advanced->Remote Access
Remote Access [Enabled]
Serial Port [COM2]
Serial Port Mode [19200,8,N,1]
Flow Control [None]
Redirection After Post [Always]
Terminal Type [vt100]
UT-UTF8 Combo Keys [Enabled]
SRedir Memory Display [No Delay]

Advanced->System Health->System Fan
Fan Speed Control [1) Disable - Full Speed]

PCIPnP
Plug and Play OS [No]
PCI Latency [64]
Allocate IRQ to PCI VGA [Yes]
Pallete Snooping [Disabled]
PCI IDE BusMaster [Disabled]

Boot->Boot Device Priority
1) Floppy
2) PC-CD-244E (cdrom)
3) MBA Slot 218 (first ethernet)
4) 3Ware (or Onbard SATA)
5) MBA Slot 219 (second ethernet)

Chipset->NorthBridge->ECC Configuration
DRAM ECC [Enabled]
MCA ECC Logging [Enabled]
ECC Chipkill [Enabled]
DRAM Scrub Redirect [Enabled]
DRAM BG Scrub [163.8us]
L2 Cache BG Scrub [ 10.2us]
Data Cache BG Scrub [ 5.12us]

Chipset->NorthBridge->IOMMU Options
IOMMU Mode [Best Fit]
Aperture Size [64MB]

Supermicro H8DAR-T version detection

The Supermicro H8DAR-T motherboard comes in (at least) two flavors. The differences that I know about between the two versions are:

* The version 2.01 board will run Opensolaris/Nexenta out of the box. This is because of a difference in the SATA controller hardware. The version 1.01 board will not run Opensolaris without an add-on controller card.

* The 1.01 and 2.01 boards use different hardware sensors (For temperature, fan speed, etc). We get sensor stats through our IPMI cards; because of this the IPMI cards need to be flashed to the specific version of the hardware. The IPMI cards do work for poweron/poweroff and console redirection without this specific firmware, only the sensors do not work if the IPMI firmware mis-matches the motherboard version.

Unfortunately, I do not see enough of a difference at POST time to be able to tell them apart. However, there are two ways I know of to do the detection.

1. With the cover of the machine off, the version can be seen in the back left corner of the board. (Will post pics later)

2. Under linux, use the "dmidecode" command. The system board uses "Handle 0x0002". What works well for me is "dmidecode |grep -A3 'Base Board' ". v1.01 boards report their Version as "1234567890" (way to go Supermicro!). v2.01 boards report as being Version "2.0". Examples:

v1board:~# dmidecode |grep -A3 "Base Board"
Base Board Information
Manufacturer: Supermicro
Product Name: H8DAR-T
Version: 1234567890


v2board:~# dmidecode |grep -A3 "Base Board"
Base Board Information
Manufacturer: Supermicro
Product Name: H8DAR-T
Version: 2.0

Saturday, February 17, 2007

Path MTU discovery and MTU troubleshooting

Recently when debugging some performance issues on a client's site, I came across some very interesting behavior. Some users were reporting that the site performed very well for a short period of time, but after a while, performance became very poor, enough so to render the site unusable. Checking the apache logfiles for the IP addreses of those clients showed that the requests themselves were not taking an unusual amount of time, but instead the requests were coming into the webserver at a snails pace.

Checking at the network level, I saw some strange things happening:

prod-lb01:~# tethereal -R "http.request and ip.addr == (client)"
125.362898 (client) -> (server) HTTP GET /search/stuff HTTP/1.1
125.362922 (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
126.612994 (client) -> (server) HTTP GET /search/stuff HTTP/1.1
126.613018 (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
129.615113 (client) -> (server) HTTP GET /search/stuff HTTP/1.1
129.615135 (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
135.616047 (client) -> (server) HTTP GET /search/stuff HTTP/1.1
135.616066 (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
Fragmentation Needed? (ICMP Type 3/Code 4) Why would we be needing to fragment incoming packets? This should only happen if the packet is bigger than the Maximum Transmission Size (MTU), and since this is all connected with ethernet, at a constant 1500 MTU, it is odd to see this.

Then I remembered this site is using Linux Virtual Server (LVS) for load balancing incoming requests. LVS can be configured in several ways, but this site is using IP-IP aka LVS-Tun load balancing, which encapsulates the incoming IP packet inside another packet and sends that to the destination server. Since this uses IP encapsulation, each request that hits the load balancer will have additional headers tacked on, to address the packet to the appropriate realserver. It happens to add 20 bytes to the header.

Okay, so the actual MTU of requests that go to the load balancer is 1480 due to the encapsulation overhead. Snooping for this type of packet at the router, I notice that we're sending out a LOT of them:

(router):~# tcpdump -n -i eth7 "icmp[icmptype] & icmp-unreach != 0 and icmp[icmpcode] & 4 != 0"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
17:07:00.608444 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.288197 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.910215 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.927728 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.391218 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.693094 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.912513 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:03.019852 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:03.398335 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
These ICMP messages are not bad, per say, they are part of the Path MTU Discovery process. However, many firewalls indiscriminately block ICMP packets of all kinds. Based on the research I did on this problem, most of the documentation I found was from the end-user's perspective, i.e., users who had PPPoE or other types of encapsulated/tunneled connections and had trouble getting to certain websites. Now with the proliferation of personal firewall hardware and software, some of which may be overzealously configured to block all ICMP (even "good" ICMP like PMTU discovery), this is something that server admins have to worry about, too, especially if running a load balancing solution which encapsulates packets.

The research I did on the problem pointed me to the following iptables rule to be added on the router:
iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1400:1536 -j TCPMSS --clamp-mss-to-pmtu
This is intended to force the advertised Maximum Segment Size (MSS) to be the 40 less than of the smallest MTU that the router knows about. However, this didn't work for us (This tcpdump line looks for any TCP handshakes plus any ICMP unreachable errors):

(router):~# tcpdump -vv -n -i eth7 "(host (client) ) and \
(tcp[tcpflags] & tcp-syn != 0 oricmp[icmptype] & icmp-unreach != 0)"
tcpdump: listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
18:00:17.479661 IP (tos 0x0, ttl 53, id 47601, offset 0, flags [DF], length: 52)
(client).1199 > (server).80: S [tcp sum ok] 2541494183:2541494183(0) win 65535
<mss 1460,nop,wscale 2,nop,nop,sackOK>

18:00:17.479861 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], length: 52)
(server).80 > (client).1199: S [tcp sum ok] 2875112671:2875112671(0) ack 2541494184 win 5840
<mss 1460,nop,nop,sackOK,nop,wscale 7>

18:00:17.771080 IP (tos 0xc0, ttl 63, id 10080, offset 0, flags [none], length: 576)
(server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
for IP (tos 0x0, ttl 52, id 47613, offset 0, flags [DF], length: 1500)
(client).1199 > (server).80: . 546:2006(1460) ack 1 win 64240
It was still negotiating a 1460 byte MSS during the handshake. In hindsight, this makes sense, because the router doesn't really know that the MTU of the load balancer and the realservers is actually smaller than 1500 - the router communicates with these machines over their ethernet interfaces, which are all still set to a 1500 byte MTU. Digging some more into the problem (Including the LVS-Tun HOWTO linked above) there were quite a few things mentioned, but no real definitive answers.

I chose to fix this problem by hardcoding the MSS to 1440 at the router, rather than using the "clamp-mss-to-pmtu" setting:
iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1440:1536 -j TCPMSS --set-mss 1440
1440 is the normal MSS value of 1460, minus the 20 byte overhead for the encapsulated packet. This seems to have fixed the problem entirely:
(router):~# tcpdump -vv -n -i eth7 "(host (client) ) and \
(tcp[tcpflags] & tcp-syn != 0 or icmp[icmptype] & icmp-unreach != 0)"
tcpdump: listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
18:02:19.466678 IP (tos 0x0, ttl 53, id 55012, offset 0, flags [DF], length: 52)
(client).1298 > (server).80: S [tcp sum ok] 2863214365:2863214365(0) win 65535
<mss 1460,nop,wscale 2,nop,nop,sackOK>

18:02:19.466886 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], length: 52)
(server).80 > (client).1298: S [tcp sum ok] 2996826059:2996826059(0) ack 2863214366 win 5840
<mss 1440,nop,nop,sackOK,nop,wscale 7>

.... silence!
PS - The reason that I was seeing this very odd behavior - very fast at first, followed by an unusable site?
  • The client website had recently added a search history, which was stored in a browser cookie. Things would go great until enough data was in the cookie to push it up over 1440 bytes.
  • I had configured my home DSL router to discard ICMP some many years back and had forgotten about it - My firewall was throwing away the ICMP Fragmentation Needed packets, so my PC never "Got the memo" that it needed to send smaller packets!
This actually worked out for the better, though - this site had had reports of odd slowness in the recent past, and hopefully this was the root cause!

EDIT: Note that in the original post, I had missed an important option, in the iptables config it is important to use the "-m tcpmss --mss 1440:1536" setting. Without this flag, iptables will force the MSS of ALL traffic to 1440, including clients which request a size smaller than that. This obviously presents a problem to the client.

Thursday, February 08, 2007

Search Engine Optimization with Apache and mod_rewrite

I've recently been using the powerful mod_rewrite to modify the URL's on a client's website. mod_rewrite is a powerful tool that lets you turn "ugly" URL's like

http://example.com/search.cgi?searchType=pie&searchTerm=pumpkin%20pie

into cleaner URL's like

http://www.example.com/pie/pumpkin_pie

This is useful for a couple reasons - not only is it cleaner to look at, but it can help with search engine indexing. In this case, because "pumpkin_pie" is part of the URL as opposed to part of the query string, the keyword ranks higher in many search engines.

Lets say we have an application that will return search results for various categories, and we want the URL's to have the format of "http://www.example.com/(category)/(search term)". Also we want to have a landing page if the URL is simply "http://www.example.com/(category)". We want to make this as generic as possible so that the httpd.conf does not need to be edited every time a category is added.

This can be configured a number of ways, but the way I have it installed here is with apache running on port 80, and the application - a java servlet container - is running on a different port, say port 8000. Apache intercepts most of the requests for static, on-disk content, and uses the proxy mechanism to send dynamic requests to the servlet container. Let's break down the relevant sections of the apache configuration file:

First, it can be useful to funnel all traffic for your site through a single hostname, as opposed to links to both "example.com" and "www.example.com". This rule will force a redirect back to "www.example.com" with a HTTP 301 redirect:

RewriteCond %{HTTP_HOST} ^example.com$ [NC]
RewriteRule ^/(.*) http://www.example.com/$1 [L,R=301]

Now lets map the static page elements and HTML to the local filesystem, so that they don't get remapped to a search query, and are served by apache instead of proxied through another layer. Note that we need to map favicon.ico to the local filesystem, else you can end up sending searches to your application when the browser requests the favicon.ico for /pie/pumpkin_pie/favicon.ico! The [L] in the rewrite modifier tells the rewrite engine to stop the processing at this point and serve the file directly.

RewriteRule ^/js/(.*) /opt/static/js/$1 [L]
RewriteRule ^/pictures/(.*) /opt/static/pictures/$1 [L]
RewriteRule ^/images/(.*) /opt/static/images/$1 [L]
RewriteRule ^/css/(.*) /opt/static/css/$1 [L]
RewriteRule /favicon.ico$ /opt/static/html/favicon.ico [L]
RewriteRule ^/robots.txt /opt/static/html/robots.txt [L]

Another useful trick is to re-map underscores to %20 in the search parameters, so we can use terms like "pumpkin_pie" that get remapped to "pumpkin%20pie" when sent to the backend application. This rule will match any URL that has an underscore in it, and then rewrite one underscore to a %20 and then send the processing back to the first rewrite rule. (So it will keep remapping them one at a time until they're all gone). This is necessary because we don't know how many underscores there might be in the URL, and there is no "replace all" modifier like "/g" for normal unix search and replace. Note the "QSA" in the rule modifiers; this means "Query String Append" and will leave any query string intact through the processing:

RewriteCond %{REQUEST_URI} ^/.*_
RewriteRule ^/(.*)_(.*) /$1\%20$2 [N,QSA]

Now lets say there are a couple of URL paths we want to treat differently, say, we need to treat the "buy" section of the site differently. With the way we map the general search cases later in this file, anything that needs to be treated differently needs to be mapped in a way that will bypass the generic match:

RewriteRule ^/buy/(.*) /purchase.jsp?cat=$1 [QSA]

Now for the "/(category)" landing page. We have to have a limitation here for categories to be only alphanumeric characters - this is so that things like "purchase.jsp" are not treated as categories! Also we prevent any request that contains a query string from being treated as a category, so we can have servlets, etc, continue to work:

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^/([a-z]*)$ /landingPage.jsp?category=$1 [NC]

Now for the generic /(category)/(searchterm) mapping.

RewriteRule ^/([a-z]*)/(.*) /search.jsp?category=$1&search=$2 [NC,QSA]

We are at the end of the line, we proxy the resulting modified URL back to our application:

RewriteRule ^/(.*) http://127.0.0.1:8000/$1 [P]

And if you run into any trouble, you can turn logging on with the following commands:

RewriteLog /opt/app/logs/rewrite.log
RewriteLogLevel 9

Now of course, these remappings only map INCOMING URL's to our application. Our application is still responsible for sending this URL format back to the user, so if a user links to your site they are using this optimized URL format. Another way to get these URLs sent to search engines is with a sitemaps file, see www.sitemaps.org for details.

Tags: , ,

Wednesday, January 24, 2007

Linux memory overcommit

Last week I learned something very interesting about the way Linux allocates and manages memory by default out of the box.

In a way, Linux allocates memory the way an airline sells plane tickets. An airline will sell more tickets than they have actual seats, in the hopes that some of the passengers don't show up. Memory in Linux is managed in a similar way, but actually to a much more serious degree.

Under the default memory management strategy, malloc() essentially always succeeds, with the kenrel assuming you're not _really_ going to use all of the memory you just asked for. The malloc()'s will continue to succeed, but not until you actually try to use the memory you allocated will the kernel 'really' allocate it. This leads to severe pathology in low memory conditions, because the application has already allocated the memory, it thinks it can use it free and clear, but when the system is in a low memory condition and an application is trying to use additional memory it has already allocated, the memory access takes a very long time as the kernel hunts around for memory to give.

In an extremely low memory condition, the kernel will start firing off the "OOM Killer" routine. Processes are given 'OOM Scores' and the process with the highest score, win^H^H^Hloses. This leads to random processes on a machine being killed by the kernel. Keeping in the airline analogies, I found this entertaining post.

I found some interesting information about the Linux memory manager here in section 9.6. This section has three small C programs to test memory allocation. The second and third program produced pretty similar results for me so I'm omitting the third:

Here are the results of the test on an 8GB debian Linux box:

demo1: malloc memory and do not use it: Allocated 1.4TB, killed by OOM killer
demo2: malloc memory and use it right away: Allocated 7.8GB, killed by OOM killer


Here are the results on an 8GB Nexenta/Opensolaris machine:

demo1: malloc memory and do not use it: Allocated 6.6GB, malloc() fails
demo2: malloc memory and use it right away: Allocated 6.5GB, malloc() fails


Apparently, a big reason linux manages memory this way out of the box is to optimize memory usage on fork()'ed processes; fork() creates a full copy of the process space, but in this instance, with overcommitted memory, only pages which have been written to actually need to be allocated by the kernel. This might work very well for a shell server, a desktop, or perhaps a server with a large memory footprint that forks an actual PID rather than a thread, but in our situation, this is very undesirable.

We run a pretty java-heavy environment, with multiple large JVMs configured per host. The problem is that the heap sizes have been getting larger, and we were running in an overcommitted situation and did not realize it. The JVMs would all start up and malloc() their large heaps, and then at some later time once enough of the heaps were actually used, the OOM killer would kick in and more or less randomly off one of our JVMs.

I found that linux can be brought more in line with traditional/expected memory management by setting the sysctls: (Apparently these are available only 2.6 kernels)

vm.overcommit_memory (0=default, 1=malloc always succeeds(?!?), 2=strict overcommit)
vm.overcommit_ratio (50=default, I used 100)


The ratio appears to be the percentage off the system's total VM that can be allocated via malloc() before malloc() fails. This MIGHT be on a per-pid basis (need to research). This number can be greater than 100%, presumably to allow for some slop in the copy-on-write fork()'s. When I set this to 100 on a 8GB system, I was able to malloc() about 7.5G of stuff, which seemed about right since I had normal multi-user processes running and no swap configured. I don't know why you'd want to use a number much less than 100, unless it were a per-process limit, or you wanted to force some saved room for fscache.

The big benefit here is that malloc() can actually fail in a low memory condition. This means that the error can be caught and handled by the application. In my case, it means that JVMs fail at STARTUP time, with an obvious memory shortage related error in the logs, rather than having the process have the rug yanked out from under it hours or days later with no message in the application log, and no opportunity to clean up what it was doing.

Here are the demo programs with a linux machine set to strict overcommit/100 ratio:

demo1: malloc memory and do not use it: Allocated 7.3GB, malloc fails.
demo2: malloc memory and use it right away: Allocated 7.3GB, malloc fails.


Technorati Tags: , , OOM

Tuesday, January 23, 2007

Debugging mysql5 on Nexenta

Due to some very favorable benchmarking results, I am planning to migrate some of our production databases to Myqsl5 on Nexenta. Previously the database was running on a Debian Linux server, and the I/O subsystem performed much worse on that system.

I ran into a very strange problem with mysql5 under Nexenta, however. After a certain number of clients connected, sometimes the server would begin to refuse connections in a very strange way. It would accept the connection to the mysql port, and then immediately close the connection.

Fortunately, one of the other reasons I want to move to Nexenta is the more robust toolchain for troubleshooting just these kinds of problems. I started out by using 'truss' on thread 1 of the mysql daemon under the assumption that it was the thread responsible for managing incoming client connections - not a bad guess. Here is a trace of a mysql connection that works correctly vs one that breaks:

Works OK:

root@perftest-db01:~# truss -w all -p 6650/1
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) = 1
/1: fcntl(11, F_SETFL, FWRITE|FNONBLOCK) = 0
/1: accept(11, 0x08047948, 0x08047958, SOV_DEFAULT) = 57
/1: fcntl(11, F_SETFL, FWRITE) = 0
/1: sigaction(SIGCLD, 0x08047420, 0x080474A0) = 0
/1: getpid() = 6650 [6589]
/1: getpeername(57, 0xFEF67A90, 0x080474B8, SOV_DEFAULT) = 0
/1: getsockname(57, 0xFEF67A80, 0x080474B8, SOV_DEFAULT) = 0
/1: open("/etc/hosts.allow", O_RDONLY) = 58
/1: fstat64(58, 0x08046B20) = 0
/1: fstat64(58, 0x08046A50) = 0
/1: ioctl(58, TCGETA, 0x08046AEC) Err#25 ENOTTY
/1: read(58, " # / e t c / h o s t s".., 8192) = 677
/1: read(58, 0x504EA88C, 8192) = 0
/1: llseek(58, 0, SEEK_CUR) = 677
/1: close(58) = 0
/1: open("/etc/hosts.deny", O_RDONLY) = 58
/1: fstat64(58, 0x08046B20) = 0
/1: fstat64(58, 0x08046A50) = 0
/1: ioctl(58, TCGETA, 0x08046AEC) Err#25 ENOTTY
/1: read(58, " # / e t c / h o s t s".., 8192) = 901
/1: read(58, 0x504EA88C, 8192) = 0
/1: llseek(58, 0, SEEK_CUR) = 901
/1: close(58) = 0
/1: getsockname(57, 0x08047938, 0x08047958, SOV_DEFAULT) = 0
/1: fcntl(57, F_SETFL, (no flags)) = 0
/1: fcntl(57, F_GETFL) = 2
/1: fcntl(57, F_SETFL, FWRITE|FNONBLOCK) = 0
/1: setsockopt(57, ip, 3, 0x0804748C, 4, SOV_DEFAULT) = 0
/1: setsockopt(57, tcp, TCP_NODELAY, 0x0804748C, 4, SOV_DEFAULT) = 0
/1: time() = 1169599669
/1: lwp_kill(73, SIG#0) Err#3 ESRCH
/1: lwp_create(0x08047240, LWP_DETACHED|LWP_SUSPENDED, 0x08047464) = 243
/1: lwp_continue(243) = 0
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)

Immediately closes connection:

root@perftest-db01:~# truss  -w all  -p 6650/1
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) = 1
/1: fcntl(11, F_SETFL, FWRITE|FNONBLOCK) = 0
/1: accept(11, 0x08047948, 0x08047958, SOV_DEFAULT) = 255
/1: fcntl(11, F_SETFL, FWRITE) = 0
/1: sigaction(SIGCLD, 0x08047420, 0x080474A0) = 0
/1: getpid() = 6650 [6589]
/1: getpeername(255, 0xFEF67A90, 0x080474B8, SOV_DEFAULT) = 0
/1: getsockname(255, 0xFEF67A80, 0x080474B8, SOV_DEFAULT) = 0
/1: open("/etc/hosts.allow", O_RDONLY) = 257
/1: close(257) = 0
/1: fxstat(2, 256, 0x08045DF8) = 0
/1: time() = 1169599778
/1: getpid() = 6650 [6589]
/1: putmsg(256, 0x080467B8, 0x080467C4, 0) = 0
/1: open("/var/run/syslog_door", O_RDONLY) = 257
/1: door_info(257, 0x08045C10) = 0
/1: getpid() = 6650 [6589]
/1: door_call(257, 0x08045C48) = 0
/1: close(257) = 0
/1: fxstat(2, 256, 0x080459B8) = 0
/1: time() = 1169599778
/1: getpid() = 6650 [6589]
/1: putmsg(256, 0x08046378, 0x08046384, 0) = 0
/1: open("/var/run/syslog_door", O_RDONLY) = 257
/1: door_info(257, 0x080457D0) = 0
/1: getpid() = 6650 [6589]
/1: door_call(257, 0x08045808) = 0
/1: close(257) = 0
/1: fxstat(2, 256, 0x08046A88) = 0
/1: time() = 1169599778
/1: getpid() = 6650 [6589]
/1: putmsg(256, 0x08047448, 0x08047454, 0) = 0
/1: open("/var/run/syslog_door", O_RDONLY) = 257
/1: door_info(257, 0x080468A0) = 0
/1: getpid() = 6650 [6589]
/1: door_call(257, 0x080468D8) = 0
/1: close(257) = 0
/1: shutdown(255, SHUT_RDWR, SOV_DEFAULT) = 0
/1: close(255) = 0
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)

Looks like the main difference starts here:
/1:     open("/etc/hosts.allow", O_RDONLY)              = 257
/1: close(257) = 0
That explains a lot -- the "hosts.allow" file is part of the tcpwrappers system, which controls access to various daemons on the system based on access control rules set by the system administrator. No wonder I am getting a connection but then immediately getting booted. It is trying to open the hosts.allow file, but then is immediately closing it, vs actually reading and processing the file as seen in the working connection. Does the process not have enough filehandles?
root@perftest-db01:~# pfiles  6650 |head -2
6650: /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql
Current rlimit: 8192 file descriptors
Nope, doesn't look that way -- it's configured to use 8192 filehandles. My next clue was the file descriptor number that was returned by the "open" system call, 257. That's awfully near one of those magic "power of 2" boundaries. I started snooping around in google.

It turns out that under Solaris, and maybe *BSD also, the tcpwrappers library (libwrap) uses the "stdio" library to manage IO. This library does not understand file handles above 255, therefore, as the mysql server continues to collect client processes and open tables for reading, eventually this file descriptor boundary is crossed and calls to open "hosts.allow"appear to fail because they return too high a file descriptor number. tcpwrappers appears to fail closed, so since it cannot read the "hosts.allow" file, it denies access to the service by immediately closing the communication channel.

Fortunately, there is a fix. Giri Mandalika has a blog entry that references the issue and is a good resource on the problem. The solution is to use the extendedFILE library that's provided in Solaris Express 06/06 or later (So this is included in Nexenta Alpha 6, and possibly earlier):

root@perftest-db01:~# export LD_PRELOAD_32=/usr/lib/extendedFILE.so.1
root@perftest-db01:~# /etc/init.d/mysql restart

(Obviously I will also need to modify the /etc/init.d/mysql startup script to include the LD_PRELOAD_32). Now, I start up a test program to artificially create a bunch of connections to the database, and see what a truss looks like now:

root@perftest-db01:~# truss -w all -p 6846/1
/1: pollsys(0x08047390, 2, 0x00000000, 0x00000000) (sleeping...)
/1: pollsys(0x08047390, 2, 0x00000000, 0x00000000) = 1
/1: fcntl(11, F_SETFL, FWRITE|FNONBLOCK) = 0
/1: accept(11, 0x08047918, 0x08047928, SOV_DEFAULT) = 294
/1: fcntl(11, F_SETFL, FWRITE) = 0
/1: sigaction(SIGCLD, 0x080473F0, 0x08047470) = 0
/1: getpid() = 6846 [6785]
/1: getpeername(294, 0xFEF47A90, 0x08047488, SOV_DEFAULT) = 0
/1: getsockname(294, 0xFEF47A80, 0x08047488, SOV_DEFAULT) = 0
/1: open("/etc/hosts.allow", O_RDONLY) = 295
/1: fstat64(295, 0x08046AF0) = 0
/1: fstat64(295, 0x08046A20) = 0
/1: ioctl(295, TCGETA, 0x08046ABC) Err#25 ENOTTY
/1: read(295, " # / e t c / h o s t s".., 8192) = 677
/1: read(295, 0x5122F9D4, 8192) = 0
/1: llseek(295, 0, SEEK_CUR) = 677
/1: close(295) = 0
/1: open("/etc/hosts.deny", O_RDONLY) = 295
/1: fstat64(295, 0x08046AF0) = 0
/1: fstat64(295, 0x08046A20) = 0
/1: ioctl(295, TCGETA, 0x08046ABC) Err#25 ENOTTY
/1: read(295, " # / e t c / h o s t s".., 8192) = 901
/1: read(295, 0x5122F9D4, 8192) = 0
/1: llseek(295, 0, SEEK_CUR) = 901
/1: close(295) = 0
/1: getsockname(294, 0x08047908, 0x08047928, SOV_DEFAULT) = 0
/1: fcntl(294, F_SETFL, (no flags)) = 0
/1: fcntl(294, F_GETFL) = 2
/1: fcntl(294, F_SETFL, FWRITE|FNONBLOCK) = 0
/1: setsockopt(294, ip, 3, 0x0804745C, 4, SOV_DEFAULT) = 0
/1: setsockopt(294, tcp, TCP_NODELAY, 0x0804745C, 4, SOV_DEFAULT) = 0
/1: time() = 1169601081
/1: lwp_kill(273, SIG#0) Err#3 ESRCH
/1: lwp_create(0x08047210, LWP_DETACHED|LWP_SUSPENDED, 0x08047434) = 274
/1: lwp_continue(274) = 0
/1: pollsys(0x08047390, 2, 0x00000000, 0x00000000) (sleeping...)
As you can see above - the "open" command on the "hosts.allow" file is returning a filehandle greater than 255, but reading and processing the hosts.allow file proceeds normally, and the connection is accepted.

Yay for truss!

Technorati Tags: , , ,

Monday, January 22, 2007

ZFS features

Here's a post I just entered on the Nexenta/gnusolaris Beginners Forum that has some good info about ZFS. Apparently the formatting got eaten on the mailing list so I'm reposting it here:


Hi all,

Can I have it installed concurrently with linux and allocate linux partitions to the RAID Z? or RAID-Z takes the whole disks?


There are two "layers" of partitions in opensolaris; the first is managed with the "fdisk" utility, the second is managed with the "format" utility - these partitions are aka "slices". I am not an expert, but I believe that the "fdisk" managed partitions are the pieces that linux/windows/etc sees. You first would allocate one of these partitions to Solaris, and from there you can additionally split that fdisk partition into root/swap/data "slices". I believe that the linux partitions you'd see would be visible via the "fdisk" command.

According to some of the ZFS faq/wiki resources, ZFS is "better" if it manages the entire disk, however, it will work just fine managing either "partitions" or "slices". You can even make a ZFS pool with individual files.

Here is an example of one of my disks. There is one "fdisk" partition, and a few "slices":


root@medb01:~# fdisk -g /dev/rdsk/c0t0d0p0
* Label geometry for device /dev/rdsk/c0t0d0p0
* PCYL NCYL ACYL BCYL NHEAD NSECT SECSIZ
48638 48638 2 0 255 63 512

root@medb01:~# prtvtoc /dev/rdsk/c0t0d0p0
* /dev/rdsk/c0t0d0p0 partition map
*
* Dimensions:
* 512 bytes/sector
* 63 sectors/track
* 255 tracks/cylinder
* 16065 sectors/cylinder
* 48640 cylinders
* 48638 accessible cylinders
*
* Flags:
* 1: unmountable
* 10: read-only
*
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
0 0 00 16065 8401995 8418059
1 0 00 8418060 16787925 25205984
2 5 01 0 781369470 781369469
6 0 00 25205985 756147420 781353404
7 0 00 781353405 16065 781369469
8 1 01 0 16065 16064


Note that in the following examples, I'll create ZFS pools with "c0tXd0s6", that is, the 6th "slice" listed in the solaris partition table.


Alternatively, Can I mount my Linux RAID partitions on Nexenta, at least for migration purposes? What about the LVM disks?


As far as I know, there is no LVM or linux-supported filesystem types built into Opensolaris/Nexenta. i.e. you could not just "mount -t ext3" a linux filesystem and be able to read it. Since you've mentioned that you're running a VMware server, I suppose it may be possible to have both guest operating systems running and copy the data over the 'network'. Also it's likely that Nexenta won't know about LVM managed partitions, it would have to be a real honest-to-goodness partition.


What about RAID-Z features:
Can I hot-swap a defective disk?


This should be possible, assuming that your hardware supports it. You may need to force a rescan of the devices if you replace a disk, check devfsadm. Reintegrating it into the pool would be accomplished with a "zpool replace pool device [new device]"


Can I add a disk to the server and tell it to enlarge the pool, to make more space available on the preexisting RAID?


Yes, with a caveat - ZFS doesn't do any magic stripe re-balancing. If you have a 4-disk pool, and add another disk, what you really have is a 4-disk raidz with a single disk tacked on at the end with no redundancy. Best practice would be to add space in 'chunks' of several disks. Fortunately I am in the middle of building a Nexenta-based box with 4 SATA drives so I can play around with some of the commands and show you the output:

Here is a 4-disk zpool using raidZ:


root@medb01:~# zpool create u01 raidz c0t0d0s6 c0t1d0s6 c0t2d0s6 c0t3d0s6
root@medb01:~# zpool status u01
pool: u01
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
u01 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
c0t2d0s6 ONLINE 0 0 0
c0t3d0s6 ONLINE 0 0 0



Here is a 3-disk raidZ pool that I "grow" by adding a single additional disk. Note the subtle indentation difference on c0t3d0s6 in this example; it is not part of the original raidz1 and is just a standalone disk in the pool.


root@medb01:~# zpool destroy u01
root@medb01:~# zpool create u01 raidz c0t0d0s6 c0t1d0s6 c0t2d0s6
root@medb01:~# zpool add u01 c0t3d0s6
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool uses raidz and new vdev is disk
root@medb01:~# zpool add -f u01 c0t3d0s6
root@medb01:~# zpool status u01
pool: u01
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
u01 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
c0t2d0s6 ONLINE 0 0 0
c0t3d0s6 ONLINE 0 0 0




Here is an example of adding space in "chunks", note the size of the volume is different in the "zpool list" before and after.


root@medb01:~# zpool destroy u01
root@medb01:~# zpool create u01 mirror c0t0d0s6 c0t1d0s6
root@medb01:~# zpool list u01
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
u01 360G 53.5K 360G 0% ONLINE -
root@medb01:~# zpool add u01 mirror c0t2d0s6 c0t3d0s6
root@medb01:~# zpool list u01
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
u01 720G 190K 720G 0% ONLINE -
root@medb01:~# zpool status u01
pool: u01
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
u01 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t2d0s6 ONLINE 0 0 0
c0t3d0s6 ONLINE 0 0 0


PS, doing it this way appears to stripe writes across the two mirrored "subvolumes".


Does it have a facility similar to LVM, where I can create 'logical volumes' on top of the RAID and allocate/deallocate space as needed for flexible storage management (without putting the machine offline)?


Yes, there are two layers in ZFS, the pool management, managed through the "zpool" command, and the filesystem management, through the "zfs" command. Individual filesystems are created as subdirectories of the base pool, or can be relocated with the "zfs set mountpoint" option if you desire. Here I create a ZFS called /u01/opt with a 100MB quota, and then increase the quota to 250MB.


root@medb01:~# zfs create -oquota=100M u01/opt
root@medb01:~# df -k /u01 /u01/opt
Filesystem kbytes used avail capacity Mounted on
u01 743178240 26 743178105 1% /u01
u01/opt 102400 24 102375 1% /u01/opt
root@medb01:~# zfs set quota=250m u01/opt
root@medb01:~# df -k /u01 /u01/opt
Filesystem kbytes used avail capacity Mounted on
u01 743178240 26 743178105 1% /u01
u01/opt 256000 24 255975 1% /u01/opt


Also, things like atime update, compression, etc, can be set on a per filesystem basis.



Can I do fancy stuff like plug an e-sata disk to my machine and tell it to 'ghost' a 'logical volume' on-the-fly, online, without unmounting the volume?


Yes, this is possible. ZFS supports "snapshots" - moment in time copies of an entire ZFS filesystem. ZFS also supports a "send" and "receive" of a snapshot, so you can then take that moment in time copy of your filesystem and replicate it somewhere else. (Or just leave the snapshot laying around for recovery purpouses).

The procedure would be to create a ZFS volume on your external drive, and then "zpool import" that drive each time you plugged it in. Then create a snapshot on your filesystem and "send" it to the external drive, like so. (I don't have an external drive to import so I'll just create 2 pools). I test by creating a filesystem, creating a file in that filesystem, then snapshotting and sending that snapshot to a different pool. Note that the file I created exists in the destination when I'm done.


root@medb01:/# zpool destroy u01
root@medb01:/# zpool destroy u02
root@medb01:/# zpool create u01 mirror c0t0d0s6 c0t1d0s6
root@medb01:/# zpool create u02 mirror c0t2d0s6 c0t3d0s6
root@medb01:/# zfs create u01/data
root@medb01:/# echo "test test test" > /u01/data/testfile.txt
root@medb01:/# zfs snapshot u01/data@send_test
root@medb01:/# zfs send u01/data@send_test | zfs receive u02/u01_copy
root@medb01:/# ls -l /u02/u01_copy
total 1
-rw-r--r-- 1 root root 15 Jan 23 04:49 testfile.txt
root@medb01:/# cat /u02/u01_copy/testfile.txt
test test test
root@medb01:/#


Hope all this helps (and maybe makes it into the wiki too :-) )