September 17, 2010

Details of the JPMorgan Chase Oracle database outage

After posting my speculation about the JPMorgan Chase database outage, I was contacted by – well, by somebody who wants to be referred to as “a credible source close to the situation.” We chatted for a long time; I think it is very likely that this person is indeed what s/he claims to be; and I am honoring his/her requests to obfuscate many identifying details. However, I need a shorter phrase than “a credible source close to the situation,” so I’ll just refer to him/her as “Tippy.”

According to Tippy,

Tippy stressed the opinion that the Oracle outage was not the fault of JPMorgan Chase (the Wednesday slowdown is a different matter), and rather can be blamed on an Oracle bug. However, Tippy was not able to immediately give me details as to root cause, or for that matter which version of Oracle JPMorgan Chase was using. Sources for that or other specific information would be much appreciated, as would general confirmation/disconfirmation of anything in this post.

Metrics and other details supplied by Tippy include:

One point that jumps out at me is this – not everything in that user profile database needed to be added via ACID transactions. The vast majority of updates are surely web-usage-log kinds of things that could be lost without impinging the integrity of JPMorgan Chase’s financial dealings, not too different from what big web companies use NoSQL (or sharded MySQL) systems for. Yes, some of it is orders for the scheduling of payments and so on – but on the whole, the database was probably over-engineered, introducing unnecessary brittleness to the overall system.

Related link

Comments

36 Responses to “Details of the JPMorgan Chase Oracle database outage”

  1. Theories and soundbites about the Chase database outage | DBMS 2 : DataBase Management System Services on September 17th, 2010 7:44 am

    [...] Edit: Subsequent to making this post, I obtained more detail about the JP Morgan Chase database outage. [...]

  2. Chris on September 17th, 2010 4:10 pm

    “The vast majority of updates are surely web-usage-log kinds of things that could be lost without impinging the integrity of JPMorgan Chase’s financial dealings”

    I disagree there. Even the less critical data can’t be lost in this sort of company, purely for auditing reasons.

    They need to know there was a failed/successful login from ip x.x.x.x at x:09 PM. Is it as important as an actual transaction? No, but they need the data none the less.

    Do they need to know that a browser claiming to be Firefox loaded the main page, then left. Probably not, but separating a lot of that data is what leads to over-engineering.

  3. Curt Monash on September 17th, 2010 6:51 pm

    Chris,

    Actually, I have a hunch that the same project queued up to replace the Oracle database will in fact break it back up by application, simplifying it somewhat. Does that mean any part can go non-transactional? Not necessarily.

    But I’m not sure that they need an ACID-compliant security audit trail even on authentication attempts, which seems to be what you were suggesting.

    In fact, I’d go so far as to suggest that even at banks, national security departments, etc., there actually are lots and lots of perimeter security devices that don’t keep all authentication request data under the control of an ACID-compliant DBMS.

  4. John Bender on September 18th, 2010 3:15 am

    “Yes, some of it is orders for the scheduling of payments and so on – but on the whole, the database was probably over-engineered, introducing unnecessary brittleness to the overall system.”

    Under the assumption that you have not worked or audited the application, how can you possibly make the speculation that the system was over-engineered?

  5. Curt Monash on September 18th, 2010 5:13 am

    The same way I comment on other technology I didn’t write personally, which includes almost everything I talk about on this blog.

  6. RJP on September 18th, 2010 4:37 pm

    “Before long, JPMorgan Chase DBAs realized that the Oracle database was corrupted in about 4 files, and the corruption was mirrored on the hot backup. Hence the manual database restore starting early Tuesday morning.”

    Having some past exposure to the Chase env; this is not the first time an issue of db corruption has caused an extended outage.

    Reading your synopsis; yes I have to agree with the over-engineering comment; but I would extend this across the whole architecture, which has lead to many moving transactions, messages, and points-of-failure that require constant attention.

    What is worth questioning is how well-balanced JPC’s high availability strategy is. From what has been described; this is common with physical replication tools; and I have seen many occurrence’s where this type of replication cascades corrupted files to the secondary/backup nodes. Reading the technologies involved; and knowing that JPC is somewhere between Oracle 9i and 10g; an assumption is being made its using Physical DataGuard, and possibly SRDF (leading to some finger pointing at EMC).

  7. Curt Monash on September 18th, 2010 5:24 pm

    Thanks, Ryan!

  8. Oracle database crashes JP Morgan Chase web site « Data In Action on September 19th, 2010 12:58 am

    [...] insight into technology and marketplace trends.”, has some very interesting details on his blog, including this: “…even before all this started JPMorgan Chase had an open project to [...]

  9. Jerry Leichter on September 19th, 2010 7:22 am

    …which leads to the interesting question: Does anyone know of a setup in which one replicates *with a delay*? That is, there’s a master and a replica to which updates are always applied with a 15 minute delay. If the master blows up, you stop the updates to the replica until you’re sure you can make them safely, then continue. (Obviously, 15 minutes is an arbitrary value – you trade off the delay in restarting against the window you have in which to detect a problem. Given enough funds, you could have multiple replicas, but even some like JPMorgan Chase would have trouble justifying that.)

    In effect, “hot backup” rather than “hot standby”.

    — Jerry

  10. DH on September 19th, 2010 7:23 am

    RJP – “What is worth questioning is how well-balanced JPC’s high availability strategy is. From what has been described; this is common with physical replication tools; and I have seen many occurrence’s where this type of replication cascades corrupted files to the secondary/backup nodes. Reading the technologies involved; and knowing that JPC is somewhere between Oracle 9i and 10g; an assumption is being made its using Physical DataGuard, and possibly SRDF (leading to some finger pointing at EMC).”

    Okay, so why the point at EMC? SRDF maintains consistency – and quickly at that, if DBMS is corrupted this is obviously replicated to failover site (active/passive mirror), the question should come down to whether SRDF alone is a suitable vehicle for near real time recovery from application error….. Ryan, care to elaborate the finger pointing?

  11. Sunny Nair on September 19th, 2010 12:30 pm

    I am going to abstain from the finger pointing dialogue but want to mention that SRDF does its job well but it is important for one to assess the implementation and the what-if’s. There are solutions available to handle the delay in replicating the data to the BCP site etc. EMC’s RecoverPoint is one such solution. As always not everything is a fit for all and needs to be designed, tested and validated based off the needs.

  12. Oracle DBa on September 19th, 2010 12:53 pm

    They could prevent this if they had dataguard standby with flashback. Or logical replica via GoldenGate. But oh oh well … as my friend says: my CIO has contingency plan in case of disaster: find new job. And disasters / bugs dont happen too often.

  13. Details of the JPMorgan Chase Oracle database outage (Curt Monash/DBMS 2) on September 19th, 2010 2:18 pm

    [...] of the JPMorgan Chase Oracle database outage (Curt Monash/DBMS 2) Curt Monash / DBMS 2:Details of the JPMorgan Chase Oracle database outage  —  After posting my speculation about the JPMorgan Chase database outage, I was [...]

  14. Curt Monash on September 19th, 2010 3:18 pm

    I reported that some of the early analysis of the problem examined what turned out to be an INCORRECT theory that EMC was to blame. So it was wholly appropriate for Ryan to speculate on why anybody might have ever held that theory in the first place.

  15. Elmari Swart on September 19th, 2010 9:50 pm

    Oracle is a nightmare to patch and is increasingly vulnerable – NITS

    http://www.computerworld.com/s/article/9057226/Update_Two_thirds_of_Oracle_DBAs_don_t_apply_security_patches

  16. calvin on September 19th, 2010 11:52 pm

    This is good reporting, but I take issue with the opinion making: “but on the whole, the database was probably over-engineered, introducing unnecessary brittleness to the overall system.”

    Over-engineering a database does not cause file corruption. File corruption is typically caused by a disk error, hence the first assumption by the JPMorgan Chase group. The corruption appears (from this post) to be a caused by a bug in Oracle DBMS itself. Over-engineering a database is unlikely to lead to the exposure of a software bug. Under-engineering or ignorant misuse of db options seems more likely to expose bugs, and that seems unlikely given JP Morgan Chase’s buying/hiring power.

    Over-engineering a DB can lead to a kind of brittleness. But it is misleading to suggest that kind of brittleness was a contributing factor to a bug-failure in Oracle. (and I’m no fan of Oracle, I think that’s over-engineering right there!)

    JPMC probably uses these files/tables along with all their other files in the same DBMS, and if they are not, then brittleness can arise from applications that are interacting across DBMS’s and other repositories. And it’s those applications that give rise to brittleness and even then are very unlikely to produce a bug driven Oracle DBMS failure.

  17. Report: Chase Snafu Slowed $132M in Transfers« Data Center Knowledge | Science on September 20th, 2010 3:26 am

    [...] been no official incident report offering details of the outage, but database industry analystCurt Monashhas an interesting unofficial account. After writing about the incident last week, Monash was [...]

  18. Non-Oracle DBA on September 20th, 2010 10:36 am

    I believe several folks have misunderstood Curt’s assessment of brittleness. I take it he was referring to how this over-engineering, rather than being a contributing *cause* of the outage, was likely a major factor in the difficulty in *ending* the outage quickly — and also of not limiting its effect to the authentication portion of the app, but ultimately leading to a lengthy outage on back-end that prevented processing of non-online transactions (which I believe was established in Curt’s prior post).

    Then again, if JPMC split this up into a MySQL authentication DB + Oracle user profile DB scenario, and the Oracle back end had a corruption, it sure seems like you have most of the same issues. The only advantage you get is that customers can authenticate into a system that has an outage in a different DBMS downstream.

    Some advantage, eh?

  19. RJP on September 20th, 2010 1:58 pm

    DH, no finger pointing; nor raising any flags on reliability of SRDF ; rather as Curt pointed out I was commenting on the possible reason why SRDF was suspected. As with many DR failures doubt is cast upon every tool in the recovery process until ruled out.

    To Curt’s suggestion that over-engineering being a factor in the recovery process; that being resolving the matter; and maintaining a low MTTR strategy. Having witnessed some architectures that fell into the category; that brittleness didn’t contribute to the failure; but it limited the options on recovery to the point where radical changes had to be implemented.

  20. Curt Monash on September 20th, 2010 2:50 pm

    The other brittleness point is that we still don’t know what was corrupted. Was it authentication? Web log type of profile data? ACH instructions? If it really was authentication, that’s a good excuse for many apps to go down at once. If it was something else, perhaps more apps came down than really “needed” to.

  21. nm on September 20th, 2010 5:38 pm

    If they had data guard configured and corruption was detected on the primary, the mrp process on the standby should have prevented it from being applied. They could have restored a copy of those 4 files from the standby and recovered them which should have lead to a smaller outage window. Taking this a step further if they were/are on 11g then there is a new feature that does automatic block recovery from primary to standby and vice versa. The fact that corruption appears to have been replicated to the dr site, leads me to believe they probably were using only a storage replication technology that would replicate corruption too…well that’s just my opinion.

  22. Chase-Sucks.org » More details on Chase’s website crash emerge on September 20th, 2010 7:20 pm

    [...] subsequently posted that the outage was caused by corruption in an Oracle database which stored user profiles. Four [...]

  23. Mike P on September 20th, 2010 10:44 pm

    I think that generally speaking array-based replication (or any replication) is used for disaster recovery and so should happen in as near real-time as possible. Technologies such point in time copies or ‘snapshots’ are more usually used for recover from corruption. The thing about corruption in a database is that you can never tell when it started, so how do you know how long to delay the application of changes to the remote copy? JPMC may well have used a snapshot to recover though and then performed database recovery. That they lost data says something about the design though.

  24. How to tell whether you need ACID-compliant transaction integrity | DBMS 2 : DataBase Management System Services on September 21st, 2010 12:30 am

    [...] a post about the recent JPMorgan Chase database outage, I suggested that JPMorgan Chase’s user profile database was over-engineered, in that various [...]

  25. M-A-O-L » When do you need ACID? on September 23rd, 2010 12:51 am

    [...] recent JPMorgan Chase outage caused by an Oracle RAC block corruption places an old question back on the agenda that gets ignored way too often: How to tell whether you [...]

  26. A little more on the JPMorgan Chase Oracle outage | DBMS 2 : DataBase Management System Services on September 24th, 2010 7:38 pm

    [...] Vijayan of Computerworld did a story based on my reporting on the JP Morgan Chase Oracle outage. He did a good job, getting me to simplify some of what I said before. He also added a quote from [...]

  27. NoSQL Daily – Sun Sep 26 › PHP App Engine on September 26th, 2010 4:15 am

    [...] Details of the JPMorgan Chase Oracle database outage | DBMS2 : DataBase Management System Services [...]

  28. How to preserve investigative reporting in the New Media Era | Text Technologies on September 26th, 2010 8:18 am

    [...] An anonymous tipster spent 2 ½ hours IMing with me to reveal the true cause of the JP Morgan Chase site outages. [...]

  29. On the JPMC outage « I, Geek on September 27th, 2010 4:32 am

    [...] 2010 at 8:32 am (data, design, engineering, musings) The blogosphere is abuzz about JPMC outage (1, 2, 3). The basic reason people site for long recovery time is a big, ambitious database design [...]

  30. Further thoughts on previous posts | DBMS 2 : DataBase Management System Services on September 27th, 2010 7:29 am

    [...] Meanwhile, RJP supplied details about the JP Morgan Chase Oracle outage that my actual source didn’t know. [...]

  31. links for 2010-09-28 | Bare Identity on September 28th, 2010 8:01 pm

    [...] Details of the JPMorgan Chase Oracle database outage | DBMS 2 : DataBase Management System Services (tags: database ha oracle jpmorgan outage) [...]

  32. JPモルガン・チェースのOracleデータベースがぶっ壊れた件 - つやてざニュース on November 17th, 2010 2:02 am

    [...] ソース [DBMS2] photo credit: peterkaminski 管理人コメント: ロギングのような、あまり重要じゃない部分がシステム全体の足を引っ張るというのは意外とよくある。もしそうだとしたら、実に本末転倒なクラッシュだったことになるね。 [...]

  33. JPモルガン・チェースのOracleデータベースがぶっ壊れた件 - つやてざニュース on November 17th, 2010 2:02 am

    [...] ソース [DBMS2] photo credit: peterkaminski 管理人コメント: ロギングのような、あまり重要じゃない部分がシステム全体の足を引っ張るというのは意外とよくある。もしそうだとしたら、実に本末転倒なクラッシュだったことになるね。 [...]

  34. Anonymouse on November 22nd, 2010 12:33 pm

    > Jerry Leichter Said:
    > Does anyone know of a setup in which one
    > replicates *with a delay*?

    MongoDB has a configurable slave delay.

  35. How information technology is really built on April 6th, 2011 7:10 am

    [...] [...]

  36. Live Databases – “the stuff of nightmares” | DataStax on October 24th, 2011 12:59 am

    [...] a third choice, and it’s often misunderstood.  Let’s start by looking at a blog post on the JPMorgan Chase incident.  The author makes the following observation, with which I agree: [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.