These are the old pages from the weblog as they were published at Cornell. Visit for up-to-date entries.

October 15, 2003

HPTS Day 2: How are databases used at big customers?

This day was focused on 'user' experiences with large database and  services. The morning started with talks by people from Google, eBay and Amazon, about their architectures. The Google talk by Anurag Acharya gave an overview of the overall system as we are already know it, with some additional information about how they deal with reliability issues. For example each data structure they use has a checksum associated with it, to handle memory corruption, which is a issue they have to deal with. The Amazon talk Charlie Bell was mainly about how they expose their store through different web services to partners and the general public. There were interesting numbers about update rates and messages sizes between Amazon and partners such as Nordstrom.

The eBay talk James Barrrese was the most fascinating of the three. Lots of detail about the backend, about the software, about the development, etc. Equally interesting where the afternoon talks by people from Swab, NASDAQ and Merrill Lynch, which all gave a look into their backend systems.

I took a lot of notes on all of these talks, and have no time/desire to transcribe them into proper prose, so if you are interested in these raw nodes, read on ...

James Barrese - eBay

  • How many datacenters: 2 planned to be 5 datacenters.
  • Update the codebase every 2 weeks. 35K lines of code chances every week.
  • Rollback of a code update within 2 hours.
  • Scalability is in the growth 70% or more per year, every year. Hardware and software
  • Database in 1999 single cluster database on a  Sun A3500.
  • Too much load : moved to no triggers, starting to partition account, feedback, archive, later categories.
  • Replication of user write and read, batch operations.
  • Current state: few triggers, denormalization of the schema
  • application enforced referential integrity, horizontal scale
  • application level sorting
  • compensating transactions,
  • application level join.
  • Database only used to durable storage, every else in application code
  • Ebay data access layer: partitioning, replication, control (logical hosts),  gracefull degradation,
  • NO 2PC. 'avoid it like the plague'.
  • index always available: if one category out still show item, but unclickable.
  • online move of table if its getting hot.
  • Centralized Application log sevice. audit, problem tracking;
  • Search: their own index, real-time update (15 sec) homegrown,
  • V2: IIS driven with C++, V3: j2EE container environment.
  • 4.5 M lines of code

Adam Richards - Swab

  • 80% revenue goes through compute channels
  • no downtime allowed
  • real-time and stateful -> back-end become single point of failure.
  • High peaks at market open & close -> tremendous over provisioning peak load easy 10:1
  • Much of the data is not partitionable, multiple relationship between data elements
  • given the rate of change (1000/month) availability is 'pretty good'
  • Some technology need to be supported because of acquisitions (archeology)
  • lack of "desire to retire" anything - investment in new stuff -> complexity goes up
  • new technologies: organic  IT
  • Isolated islands of data and compute (often religious, sometimes technological)
  • linkage of island through messaging
  • 60% of project spent on mitigating complexity
  • lower availability due to complexity
  • slower time to market
  • Design: SOA with application bus, request/reply, pubsub and streams
  • Nobody wants distributed 2PC. Only when forced by legal reason. 2PC is the anti-availability product
  • Q: clients want datagrade access, but services want interfaces
  • Q: how to decouple platforms 1) compute - asynchronous but no 2pc 2) data - propagation & caching but no 2PC
  • Q: how to act when the system is not perfect.
  • More and more unneeded information on home page, peaks when logon (10's of transactions)

Ken Richmond - NASDAQ

  • High performance data feed
  • Top 5 levels of NASDAQ order book
  • near real-time, short development time
  • messages self-describing
  • windows gateway msmq 'channels' partitioned by alphabet of the symbol
  • SQL server clusters passive-active - failover in 1 minute
  • SAN storage (EMC)
  • Multicast fan-out
  • pipeline architecture, everything should do 'one thing', serialization through single threading (regulation requires FIFO)
  • sending blocks of messages to SQL server improves performance
  • biggest reason for app outage is overrun - flow control between the mainframe and intel boxes
  • entire DB in memory all the time
  • interface in C++, logging event to Tibco Hawk
  • Test bed investment is crucial
  • Saturdays functional testing, Sunday performance testing
  • 10,800 msg/sec sustained in test, 30K database ops/sec in test, 1800 msg per second per channel production, 7800 overall in production

Bill Thompson - Merrill Lynch

  • GLOSS - 2 tier architecture sybase/solaris PowerBuilder GUI
  • all business logic in stored procedures
  • uses tempdb as a key component
  • use: standing data maintenance, transaction management, interface to external clearing services (msgs), books & records
  • some of the topology: transaction manager front-office then to settlement engine and to the books & records, T+x updates from clearing house
  • 70K trades and settlements 60K prices, 40K cash movements ... 250,000 transactions (=100's sql statements) /  day
  • limits in 2001: transaction rates too high, overnight extracts over the window, DB 450 GB: unable to do maintenance tasks
  • Then move to SQL server
  • Challenges: migrate stored procedures, maintain the Unix code, migrate database in 48 hours
  • Unix - SQL server connectivity: use FreeTDS (open source) engine (Sybase ODBC and others didn't work or were a performance drag)
  • migrating the data was painfully slow using vendor tools. They wrote utilities themselves using bcp functionality from FreeDTS
  • SQl server 'bulk insert' rocks
  • performance increase transaction throughput 30%, database extracts from 235 mins to 25 minutes
  • Future: 50-100% increase of transaction expected in 2004
  • SQL Server issues: stored procedure recompilation
  • Redesign: distributed transaction processing, consolidate stock & cash ledgers, data archival
  • re-architect the queue-based processing.
  • database now 1T, backup: full in 3-4 hours, incremental 1 hour

David Patterson - Berkeley

  • Recovery Oriented Computing
  • margin of safety works for hardware, what does it mean for software?
  • tolerate human error in design, in construction and in use
  • Raid 5: operator removing wrong disks, vibration/temperature failure before repair
  • never take software raid
  • ROC philosophy: if a problem has no solutions, it may not be a problem but a fact - not to be solved but to be coped with over time - Shimon Peres
  • people HW SW fails is a fact
  • how can we improve repair
  • MTTF is difficult to measure, MTTR is better measurement
  • if MTTR is fast enough, it is not a failure
  • 'Crash only software' Armando Fox
  • Recognize, Rewind & Repair, Replay
  • Application proxy records input & output, feeds into an undo manager, combined with a no-overwrite store.
  • Lesson learned: takes a long time to notice the error, longer than to fix it
  • Operators are conservative, difficult to win trust
  • Configuration errors a rich source of causes for errors
  • Automation may make the ops job harder - reduces ops understanding of the system

Paul Maglio - IBM

  • Studies about system administrators
  • important users, expensive, understudied
  • interface for system administrators are similar to regular users, but this may not be correct because they are different
  • field studies, long term videos
  • Talks about how misunderstanding play a major role, how admins control info while talking to others
  • All about saving face & establishing trust
  • Most of the time spent troubleshooting is spent on talking to other people (60%-90%)
  • 20% time about who (else) to talk to? (collaboration)
  • typical day: 21% planning 25% meetings, 30% routine maintenance, half of all activities were involved in collaboration
  • DBAs always work in pairs or teams, meticulous rehearsal, attention to details, but easy to panic.

David Campbell - Microsoft

  • Supportability - Doing right in a Complex World
  • Emerging Technology: lots of knobs, high maintenance testing replacing, etc. Everybody needs to be the expert
  • Evolving Technology: some knobs, automation, good enough for the masses - some local gurus
  • Refined Technology: no knobs, goal directed input - it just works - user base explodes - small expert base
  • Design targets: non-cs diciplines have design targets, cs often not, certainly not supportability
  • Design for usability for common users: "virtual memory too low ..."
  • Limited design for test, no standard integrated online diagnostics
  • Techniques similar to those 20 years ago
  • Errors: don't be lazy make explicit what you want the user to do
  • Never throw away an error context
  • 'Principle of least astonishment':
  • do reasonable user action yields reasonable results? Make it hard for people to hose themselves
  • With respect to reliability, how much of the complexity is spend on actually making the system more reliable
  • In automobiles today faults are recorded to help the mechanics.
  • Customers are willing to invest up to 20% to get to 'it just works'
  • in sql server introduced more closed feedback loops to reduce operator involvement
  • Automated, adaptive control based on 'expert systems knowledge'
  • result were surprising, results are better than the best expert could do.
  • example: memory allocated, statistics collection, read-ahead, system structures dynamic allocated
  • Memory: replacement cost for each page, utilities 'bid' on memory, feedback on cost of allocation.
Posted by Werner Vogels at October 15, 2003 01:18 PM

Big Databases
Excerpt: Werner Vogels reports some more from HTPS '03: How are databases used at big customers? — Big like in eBay, Schwab, NASDAQ, Merrill Lynch.Besonders interessant ist das eBay'sche Backend: "application enforced referential integrity, application le...
Weblog: Bernhard Seefeld's Blog
Tracked: October 16, 2003 01:00 PM
In case you were ever curious
Weblog: justinbigelow
Tracked: October 16, 2003 08:58 PM
High-end Backend
Weblog: Ashutosh Nilkanth's .NET Blog
Tracked: October 29, 2003 01:00 PM
Database Usage
Weblog: Extreme RAD from a Trading Desk
Tracked: October 27, 2004 05:36 AM
Databases as Services
Weblog: Andres Aguiar's Weblog
Tracked: May 31, 2005 12:16 AM