BlackBerry Forums Support Community
              

Closed Thread
 
Thread Tools
Old 10-02-2006, 02:26 PM   #1
mahoward
CrackBerry Addict
 
mahoward's Avatar
 
Join Date: May 2005
Model: 8900
Carrier: T-Mobile
Posts: 560
Default Delayed Email Delivery (5-15 mins) on Domino Cluster Failover

Please Login to Remove!

Hey all Domino BES folks,

We had our primary server for one of our larger cities go down for an extended maintenance (18+ hours) on Saturday. No problem, Domino cluster is offsite. Well BES 4.0 SP5 HF2 started delivering mail from the cluster, but there was a delay of 5-15 mins for delivery to the device. Actually the delay was in checking email, not delivery.

Looks like each thread for that city's mail server only did work at 10 second intervals, then was quiet for another 10 seconds. All the other threads, for mailservers in other cities, were fine. Is there some timout or primary cluster check that takes 10 seconds before accessing the cluster node and doing any work? I've talked to RIM but they haven't gotten back to me yet. Any other real world experiences with BES 4.0 failover to a Domino cluster node, and the effects (if any) this caused?

Thanks!
Offline  
Old 10-02-2006, 03:27 PM   #2
amukhey
CrackBerry Addict
 
amukhey's Avatar
 
Join Date: Sep 2004
Location: Los Angeles
Model: 9700
Carrier: T-Mobile
Posts: 750
Default

I would:

1. Check the load of the cluster server, can it take it or running out of memory? Enough Crank?

2. Do you have a server dedicated to SMTP (Hub Server?) or is that task running on the Cluster Serveras well?

3. How many mail.box do you have on that cluster server?

4. Since your mail servers are not up, your cluster is taking a lot of the load...unless you have 2 failover servers in a cluster that would share the load..i.e. #1

*note that I have done this in the past. The reason for the delay we have seen is because our DR (Disaster Center) cluster server is doing the smtp routing to begin with along with routing to the devices--That is a lot of load for one server to do for 1500 users as their primary fail over + BlackBerry + SMTP.*
Offline  
Old 10-03-2006, 03:14 PM   #3
mahoward
CrackBerry Addict
 
mahoward's Avatar
 
Join Date: May 2005
Model: 8900
Carrier: T-Mobile
Posts: 560
Default

Hi amukhey,

1. The failover cluster mail server is an iSeries AS/400 instance, no issues performance wise there. Also this was done on a weekend.

2. We do have dedicated SMTP servers, so no issue there either.

3. We have 2, mail1.box and mail2.box

4. This was only the NY server that was down for maintenance, so there were only 250 users on the clustered mail server.

The BES server hosts 1325 today, but only ~250 are NY users on the mail cluster that failed over.

This appears not to be a performance issue, as the server was loafing for >18 hours while messages were consistently delayed. RIM claims it was from alot of OTAFM reconciliation activity. Well the next morning the primary came back online, and there was whole lot of OTAFM reconciliation activity as BES scanned the primary again. But guess what? No delays. Even while scanning all the mailboxes on the primary, messages were still getting delivered instantly to the devices, as soon as the server came up. This is why I think there is something blocking or delaying activity while the primary cluster node is down. I have said as much to RIM however have not received a response yet as to this new info.

Once again, would appreciate it if you or any other Domino shops watch during a failover (planned or unplanned) and see if their deliveries begin having 5-15 minute delays. The delay seems to be in scanning the mailbox for new messages, actual delivery is fast once it grabs the mail. Just want to see if I am not crazy here, maybe there is something screwy in our environment but not sure as of yet.
Offline  
Old 10-03-2006, 05:15 PM   #4
Puddin
Thumbs Must Hurt
 
Join Date: Sep 2004
Model: 8800
Carrier: AT&T
Posts: 60
Default Failover delay

Does the BB server have a connection document to both servers?

Ken
Offline  
Old 10-04-2006, 11:38 AM   #5
mahoward
CrackBerry Addict
 
mahoward's Avatar
 
Join Date: May 2005
Model: 8900
Carrier: T-Mobile
Posts: 560
Default

Nope, no connection docs as the BES server does not need to route mail or replicate to those servers. Only connection docs to the SMTP relays.
Offline  
Old 10-04-2006, 01:47 PM   #6
Sagz
Knows Where the Search Button Is
 
Join Date: Feb 2006
Model: 8100t
Carrier: tmobile
Posts: 45
Default

Did you ever experience this before going to sp5, and did your threadpool reoptimize itself to handle the other server efficiently?
Offline  
Old 10-04-2006, 04:38 PM   #7
asameer
New Member
 
Join Date: Jan 2006
Model: 7520
Posts: 6
Default

I have the same issue with Exchange 2003 Cluster. We failed over one of the nodes last weekend and since then we are facing problems with delayed emails. The delays vary from 15 min to 1 hr. If it goes up to 1 hr we simply restart the services. This trick is working for us. But I am looking for some permanent solution. Can anyone help?
__________________
Thanks,
Sameer
Offline  
Old 10-05-2006, 04:14 PM   #8
mahoward
CrackBerry Addict
 
mahoward's Avatar
 
Join Date: May 2005
Model: 8900
Carrier: T-Mobile
Posts: 560
Default

We did experience the same thing under SP3 a few months back, I thought I saw something about a bug fixed in SP5 regarding "database backoff", so I hoped that took care of it. Apparently not...

Not sure how I would know if the threadpool "re-optimized" itself - can you throw me a bone on that one?
Offline  
Old 10-09-2006, 09:17 AM   #9
mahoward
CrackBerry Addict
 
mahoward's Avatar
 
Join Date: May 2005
Model: 8900
Carrier: T-Mobile
Posts: 560
Default

Followup: Turns out RIM believes it was the synching of read/unread marks between the cluster nodes. Most of our databases sync unread marks as of 8/05 but there is a substantial number that have pre 8/05 mail where the unread marks are different. RIM assures me that if the unread marks were perfectly in sync then there should not be these delays. Well we will see, perhaps in 3 years when the old email is aged out...
Offline  
Old 10-26-2006, 12:31 PM   #10
mcreemer
Knows Where the Search Button Is
 
Join Date: Aug 2005
Model: 8320
Carrier: T-Mobile
Posts: 21
Default

I think I have an additional piece of information for you. This same issue happened to me and we did notice the Blackberry servers doing full mail file scans (OTAFM traffic as mentioned above) as the BES sees it as e new source. These mail file scans do cause some additional traffic but the biggest problem is that the BES API/Domino program apparently doesn't have smarts to realize that the primary mail server is down and keeps want to connect to it. The result is that whenever the BES wants to connect to someone's mail file, there's a connect attempt to the primary server, and then it times out and then finally connects to the failover server. Not an issue if you have a low number of users but if you (like me) have 400/500 users on a BES then the server cannot keep up and the BES process becomes slower then a turtle on crutches. Result is that people will see slow lookups, delayed Email (varying from minutes to hours). By looking at the taskmanager you will see that the BES server is really not doing much and also the network volume is minimal. I guess the only workaround for this issue is to:

- Up your polling time from 20 seconds to something much higher.
- Change the person document for the affected users and list the failover server as their mail server. This way the BES will immediately attempt to connect to the failover server without a connection attempt to the dead server.
Offline  
Old 10-26-2006, 12:33 PM   #11
mcreemer
Knows Where the Search Button Is
 
Join Date: Aug 2005
Model: 8320
Carrier: T-Mobile
Posts: 21
Default

BTW, our read/unread marks were set to replicate accross servers for all users so who (at RIM) told you that there will be no more delays is wrong.
Offline  
Old 10-26-2006, 01:23 PM   #12
mahoward
CrackBerry Addict
 
mahoward's Avatar
 
Join Date: May 2005
Model: 8900
Carrier: T-Mobile
Posts: 560
Default

Thanks mcreemer, this is what I suspected all along. I gave them all the info and my suspicions that you describe above and they still say (oh, there was a bunch of OTAFM activity, there's your problem!).

The workarounds make sense, but I hate this bug!!!
Offline  
Old 10-26-2006, 03:16 PM   #13
x14
BlackBerry Extraordinaire
 
Join Date: Jul 2005
Location: NYC
Model: 9800
OS: 6.0.0.546
Carrier: AT&T
Posts: 2,344
Default

We do see delay when our the user's primary cluster server is down. We found that the BES has to fail on the primary server first before moving on to the other cluster node. At one point we had all our BB users on 1 server and there was a lot of delay because the BES has to fail first before going to the working server for everyone.
Offline  
Old 10-26-2006, 03:38 PM   #14
mahoward
CrackBerry Addict
 
mahoward's Avatar
 
Join Date: May 2005
Model: 8900
Carrier: T-Mobile
Posts: 560
Default

OK, now that we have established that there is a delay, why is it there? I still have a BES 2.2 server, and when the primary mail server fails 2.2 will kick over to the cluster without any delay whatsoever, while the 4.0 server goes into delay mode. What changed in the 4.0 code to create this timeout issue every time the mailfile is polled?
Offline  
Old 10-27-2006, 03:46 PM   #15
amukhey
CrackBerry Addict
 
amukhey's Avatar
 
Join Date: Sep 2004
Location: Los Angeles
Model: 9700
Carrier: T-Mobile
Posts: 750
Default

Well, I am doing a planned failover in 2 weeks. I can confirm the above soon after that. Planned outage is for 1000 users on 2 production BES Servers. 4.0.5.13
Offline  
Old 10-30-2006, 11:56 AM   #16
mahoward
CrackBerry Addict
 
mahoward's Avatar
 
Join Date: May 2005
Model: 8900
Carrier: T-Mobile
Posts: 560
Default

Cool, please post your experience after your outage. Thanks!
Offline  
Old 11-02-2006, 03:50 PM   #17
kgaughan
New Member
 
Join Date: Jul 2006
Model: 8820
Carrier: ATT
Posts: 5
Default

4.0 offers WIRELESS email reconcilation (read/unread, deleted, and filed messages), 2.2 does not. As a result, the BES state DB's now contain a read/unread marks table. Upon failover, the BES has to update this table for each user by completing a scan of all user mail files and update State Db.
Offline  
Closed Thread



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


128K RAM - APPLE - ORIGINAL APPLE prototype BOARD picture

128K RAM - APPLE - ORIGINAL APPLE prototype BOARD

$408.75



APPLE 630-0895-B  VRAM 128K X 8 BOARD CARD VINTAGE picture

APPLE 630-0895-B VRAM 128K X 8 BOARD CARD VINTAGE

$74.77



APPLE 820-0522-A 630-0895-B LITE VRAM 128K X 8 BOARD  picture

APPLE 820-0522-A 630-0895-B LITE VRAM 128K X 8 BOARD

$149.99







Copyright © 2004-2016 BlackBerryForums.com.
The names RIM © and BlackBerry © are registered Trademarks of BlackBerry Inc.