?

Log in

No account? Create an account
entries friends calendar profile My Website Previous Previous Next Next
On storing phone numbers in databases - Mark Atwood
fallenpegasus
fallenpegasus
On storing phone numbers in databases
The client du jour has a table with a column that keeps phone numbers. This is normal.

However, the column is a varchar(255) charset UTF8.

This is not okay.

In MySQL, that will take up 765 bytes per row in memory, plus will be doing UTF8 to to 3byte expansion for every row read or scanned. And it has to use the wchar string ops to do comparisons.

The ISO standard for PSTN phone numbers, E.164, states that phone numbers are a max of 15 digits long, and that includes the country code.

Make phone numbers a varchar(16) charset ASCII. Have your application mash out all the spaces and hypens.

Then marshalling and expansion is free, and modern processors can do an equality test against 16 bytes in a single operation.

16 bytes vs 765, that's a 50x savings.

Oh, and stop assuming that all phone numbers are in the US and Canada, and thus are of the form (AAA) EEE-TTTT. There are acceptably good library routines in your language of choice that will accept a string of digits, look at the country code, know that country's dialing plan, and output a "pretty formated" phone number. Don't try to keep pretty formated phone numbers in your database, and don't try to make the database server do the pretty formatting.

Tags:

21 comments or Leave a comment
Comments
dossy From: dossy Date: October 1st, 2008 03:25 pm (UTC) (Link)
Uh, exactly why would you store a number as an ASCII string? 48 bits (6 bytes) are sufficient to represent any 15 digit number.

6 bytes vs. 16, it's only a 2.6x savings, but 6 bytes vs. 765 is a 127x savings! (Sounds more impressive that way, heh.)

vatine From: vatine Date: October 1st, 2008 03:32 pm (UTC) (Link)
Telephone numbers aren't "numbers", really. So storing them as huge integers is fraught with curious challenges (not, mind you, that anyone seems to use the A, B, C and D "numbers" for anything).

Also, can we have a rant about storing names in databases? Including a rant about "middle names" and how these assumptions FAIL when you're operating outside British and American (to some extent) cultural constraints.
fallenpegasus From: fallenpegasus Date: October 1st, 2008 03:55 pm (UTC) (Link)
That's a complex issue. And one for another post.

And much of the problem is, unlike for telephone numbers, there are not useful libraries in language-of-choice that can be handed a UTF8 string, and hand back the tagged name-parts.
vatine From: vatine Date: October 1st, 2008 04:13 pm (UTC) (Link)
It is a very complex issue. I know of one method that (almost) handles (several) "Western" conventions, as long as you don't need to separate out the surname (but can, concievably, be extended to handle that). Thinking about it, it MAY even work for other name-orders too.
akicif From: akicif Date: October 1st, 2008 05:08 pm (UTC) (Link)
Tell me about it. Having to get contact info from a spreadsheet - not even a database - where some eejit had stored phone numbers as numbers: bye-bye leading zeroes, and in some cases numbers with dashes in had displayed as the results of the appropriate subtraction!
wyckhurst From: wyckhurst Date: October 1st, 2008 03:39 pm (UTC) (Link)
I have no idea what you are talking about. ha ha ha!
fallenpegasus From: fallenpegasus Date: October 1st, 2008 03:57 pm (UTC) (Link)
It's like using just one little line in an address book to keep a phone number, or else using an entire page in the address book, and insisting that your handwriting is an imporant part of the phone number.
wyckhurst From: wyckhurst Date: October 1st, 2008 04:05 pm (UTC) (Link)
ah, gotcha! :D
supersnazz From: supersnazz Date: October 1st, 2008 04:23 pm (UTC) (Link)
The main trouble I see here comes in when you start looking for work phone numbers (or, in my experience, students' university dorm phone numbers): more and more people are including personal extensions to a phone number. You can bump your phone number field up to 20 characters (I've not seen an extension longer than five digits). Unfortunately, once you strip out non-numerics, it's impossible to tell an NANP local number with a three digit extension from an extensionless number with an area code. Short of having a frequently updated lookup table of NANP area codes, there isn't a reliable means of telling whether a three digit combo is an area code or an exchange code.
displague From: displague Date: October 1st, 2008 04:44 pm (UTC) (Link)
This was going to be my point ;-)

You can't store "1-555-555-5555,1#12423" as a numeric. You could add a "phoneext" field, but what's the point? Who does queries based on phone number anyway? Who indexes it? Who does math against phone numbers? Maybe a few autodialers.

It's user generated data that you can't format without causing a detrimental experience to some users. 50% of the time, the number is going to be validated by the user when he sees it displays and confirms that it is his number, the other 50% will be validated by a human trying to reach that user/customer and those extra characters (or notes (eg. "555-555-5555 8a-5p")) may be meaningful.
fallenpegasus From: fallenpegasus Date: October 1st, 2008 05:50 pm (UTC) (Link)
"Who does queries based on phone number anyway? Who indexes it?"

It happens a fair amount.

In this particular case, the application is a POS terminal, so looking up a person's account at that store by their phone number is a reasonably common thing.
vatine From: vatine Date: October 1st, 2008 07:41 pm (UTC) (Link)
Who does queries based on phone number anyway?

Not maths, per se, but at least index checks. Last place I worked, we had one application that regularly dissected numbers, did range checks and all sorts of nasty to them (because when the app is the back-end to a multi-service telephony platform, you sometimes need to). That stored phone numbers as a varchar(20) in MySQL (well, who knows, some psycho international phone conspiracy may add a couple of digits next century; no I didn't write it, I only made sure it was happy).
fallenpegasus From: fallenpegasus Date: October 1st, 2008 08:12 pm (UTC) (Link)
Cool.

Did it store international phone numbers, or NANP only.

NANP only applications are getting more and more annoying each passing day. As business and life has become more fully internationalized, and international call prices have crashed to near zero, and even international cell roaming has become affordable...

Some of the people I work with have EU numbered cellphones they carry everywhere with them, including in the US. It's an "international call" to call them, even when they are in the same city.

I'm starting to get annoyed with even having to do the "international call 011" on my own phone. Just let me dial the CC without telling the switch "yes really, i want to make an international call".
vatine From: vatine Date: October 1st, 2008 10:09 pm (UTC) (Link)
The one that was in-house did "national E.164", the other one, that didn't do nearly as many features, but (on the whole) was a better phone switch, used full E.164. None of them did NANP in our implementation (the in-house only needed to know the UK dial plan, the commercial could, theoretically, do NANP, but we didn't load the NANP rule-set, I did a ruleset for Sweden, though, as an exercise in dial-plan writing).
tcepsa From: tcepsa Date: October 1st, 2008 05:17 pm (UTC) (Link)
What about using a separate field for extensions? varchar(5), ASCII encoding. You might not be able to use language-specific methods to help manage it, but then you couldn't have in the first place so that's no real loss...

Or does this cause a significant amount of overhead by having another column in the database?
tcepsa From: tcepsa Date: October 1st, 2008 05:22 pm (UTC) (Link)
I love posts like this. A good practice, with solid reasoning and concrete numbers behind why it is a good practice. (I'm grumbly about the amount of knowledge--especially when it comes to databases--that seems to propagate itself because "an expert said so" without having any additional specifics to back it up. For example, I only finally understood a little bit about how indices work when I took my data structures class last semester; until then they were just some mystical thing that made databases faster unless they were implemented poorly and just made everything worse).

So things like this, that give details about the internal gotchas, are greatly appreciated!
awfief From: awfief Date: October 1st, 2008 10:02 pm (UTC) (Link)
I think you should expound more upon why the field should be varchar(16) instead of varchar(255) -- you explain why utf8 vs. ascii well. "But," I can hear folks asking, "if it's variable length anyway, what's the problem with having varchar(255)? If I'm worried about more than 16 digits, can I use a varchar(20) or a varchar(50) or varchar(99) or varchar(100)? What's the difference?"
awfief From: awfief Date: October 1st, 2008 10:05 pm (UTC) (Link)
(you say that it will take up to 765 bytes per row, but you neglected to mention how many bytes if there are just 16 characters...and here I'd assume ASCII, because what I'd like is for you to explain why you should have a varchar(16) instead of varchar(255). )
fallenpegasus From: fallenpegasus Date: October 2nd, 2008 01:43 am (UTC) (Link)
It will take 765 bytes per row in memory, because MySQL expand a UTF8 character to a 24 bit representation internally (which I think is a mistake, since that still freezes out the high code pages that make Unicode so interesting.)

VARCHAR(16) in ASCII encoding will take 16 byes, because an ASCII character is stored in a single byte.

It will take the full 765 or 16 bytes, no matter how long the actual string is, because MySQL expands varchars to their full width in memory.


All this has to do with the row representation in memory. The storage engines, such as MyISAM and InnoDB, can and do store UTF8 as, well, UTF8, eg, variable length encoding from one to six bytes per character, and only store the actual number of characters in the string.

Edited at 2008-10-02 01:44 am (UTC)
arjen_lentz From: arjen_lentz Date: October 1st, 2008 11:26 pm (UTC) (Link)

not just 16

There's phone extensions as well, not just in companies with PABXs but also in Germany where ISDN is rather prolific. But yes, certain a more sane number than 255 is quite possible, and utf8 not needed.
I've even suggested using an INT|BIGINT UNSIGNED, however if people want to store alphanumeric numbers (1-1800-something) that does need to be taken into account.

From: wnoise Date: October 2nd, 2008 12:01 pm (UTC) (Link)

Re: not just 16

There's one very small issue with storing only the digits -- numbering systems and formatting change. Germany, for instance, has a variable-length prefix coding (bigger cities get smaller prefixes). This mapping does get updated on occasion. I can easily see a number becoming outdated, and new prefixes put in that end up splitting it incorrectly as a call to a different region than it was, or even choking on it. Is it an issue in practice? Probably not. Telephone networks are pretty good about making changes in nicely backwards-compatible ways though, and relatively long overlap periods. And the ways of formatting numbers in Germany isn't nearly as standardized as in NANP.

Of course, I'm not sure there is an entirely right thing to do in cases like this, but formatting information isn't /entirely/ redundant.

But yes, 255 is way too big, and UTF-8 encoding what fits nicely in ASCII is silly.
21 comments or Leave a comment