MySQL and Friendly URLs

enouwee

Non ex transverso sed deorsum
Reaction score
240
Let's do another little experiment.

I start out with my previous setup I've been using for 1 million entries.

1) The "words" consists of:

Code:
mysql> desc test_words;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int(32)      | NO   | PRI | NULL    |       |
| word  | varchar(255) | NO   | UNI | NULL    |       |
+-------+--------------+------+-----+---------+-------+

mysql> select count(1) from test_words;
+----------+
| count(1) |
+----------+
|     9993 |
+----------+

mysql> select avg(length(word)) from test_words;
+-------------------+
| avg(length(word)) |
+-------------------+
|            6.3832 |
+-------------------+
1 row in set (0.01 sec)

2) Now I create two additional tables, one being fixed-width, the other varchar-based:

Code:
create table test_words_char (
  id integer(32) primary key auto_increment,
  word char(9) not null,
  constraint unique test_words_char_u1(word)
);

create index test_words_char_word on test_words_char(word);

create table test_words_varchar (
  id integer(32) primary key auto_increment,
  word varchar(9) not null,
  constraint unique test_words_varchar_u1(word)
);

create index test_words_varchar_word on test_words_varchar(word);

3) On disk, we've got:

Code:
12      test_words_char.frm
0       test_words_char.MYD
4       test_words_char.MYI

12      test_words.frm
196     test_words.MYD
324     test_words.MYI

12      test_words_varchar.frm
0       test_words_varchar.MYD
4       test_words_varchar.MYI


4) Let's add some data:
Note that I've had to filter out some rows because of duplicate values:

Code:
mysql> insert into test_words_char (word) select distinct left(word, 9) from test_words;
Query OK, 9777 rows affected (0.14 sec)
Records: 9777  Duplicates: 0  Warnings: 0

mysql> insert into test_words_varchar (word) select distinct left(word, 9) from test_words;
Query OK, 9777 rows affected (0.13 sec)
Records: 9777  Duplicates: 0  Warnings: 0

Thankfully the length is the same:

Code:
mysql> select avg(length(word)) from test_words_char;
+-------------------+
| avg(length(word)) |
+-------------------+
|            6.1663 |
+-------------------+
1 row in set (0.00 sec)

mysql> select avg(length(word)) from test_words_varchar;
+-------------------+
| avg(length(word)) |
+-------------------+
|            6.1663 |
+-------------------+
1 row in set (0.01 sec)

And so are the words inside:
Code:
mysql> select count(1) from test_words_char c where not exists (select 1 from test_words_varchar v where v.word = c.word);
+----------+
| count(1) |
+----------+
|        0 |
+----------+
1 row in set (0.09 sec)

5) How much disk space do the various tables use?

Code:
12      test_words_char.frm
320     test_words_char.MYD
256     test_words_char.MYI
[B]588     tota[/B]l

12      test_words_varchar.frm
192     test_words_varchar.MYD
256     test_words_varchar.MYI
[B]460     tota[/B]l

12      test_words.frm
196     test_words.MYD
324     test_words.MYI
[B]532     total[/B]

It shouldn't come as a surprise that the varchar(9) table uses the least space. I hope you're not too surprised that the constant-length table wastes quite a lot of bytes because the most common words are shorter than 9 characters. Nevertheless, it's funny to note that the original table with all and untruncated rows occupies less that the constant-length table.

What does this tell you? You should only use char(...) when your data is fixed length, like 14-digit coded dates: 20090120123000.

But I'm sure you'll tell us why the fixed-length strings will perfectly make sense in your case. After all, aren't you aiming at saving maybe 50-100KB on the words table when the additional join indexes you need already take up more than 10MB?
 

JerseyFoo

1/g = g-1
Reaction score
40
Static tables are supposedly stabler and faster, perhaps not for string searches. Either way I'd like to try it.

Nevertheless, it's funny to note that the original table with all and untruncated rows occupies less that the constant-length table.
Not if your talking index.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
Static tables are supposedly stabler and faster, perhaps not for string searches. Either way I'd like to try it.
I knew you'd find something. :rolleyes:

Not if your talking index.
I'm referring to the whole space required of course, but mostly to the tables, as they're of different structure. The indexes are exactly the same for both tables and thus require the same space per row, whether your columns are char or varchar.
 

JerseyFoo

1/g = g-1
Reaction score
40
I need an algorithm to most fastly and effectively figure out uniqueness of English titles in URI format

"This Is Most Common Title" = "this-is-most-common-title" ~ 1
"This Is 2nd Most Common Title" = "this-is-2nd-most-common-title" ~ 2

...would be maximum efficiency, however it needs to be done nearly instantly upon request, so has to be rough.

Example: By first char of each word

"this-is-most-common-title" ~ timct
"this-is-2nd-most-common-title" ~ ti2mct

Can start there :D
 

Samuraid

Advisor
Reaction score
81
If you have a base string and you need to figure out the distance to other strings, you might try something like this:
http://php.net/levenshtein

Otherwise, the "best" algorithm you use completely depends on the data patterns you expect in the specific URLs on your site.

If you are already calculating the formatted URL form (i.e. "this-is-most-common-title"), you might as well just compare the entire formatted URL string with others as it takes roughly linear time, and your algorithm to get the first characters of each word also takes roughly linear time, but doesn't run in place (it creates extra strings) and requires another comparison on top of it to compare the smaller strings it generates.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
If you have a base string and you need to figure out the distance to other strings, you might try something like this:
http://php.net/levenshtein
Unfortunately, the only string similarity operator available in MySQL is soundex(), which should be a must in this case ([THREAD=116981]see here[/THREAD] for the first part of the story). It'd be easy with Oracle as backend, as it provides (double) metaphone as standard function.

You may implement the Levenshtein distance as stored procedure, but you'll still be dependent on other criteria (range scans) for good performance. Even with MySQL, soundex could at least be used as indexed column.
 

JerseyFoo

1/g = g-1
Reaction score
40
SoundEx has actually come a 2nd in efficiency ( by length, anyway ) and a dead last in accuracy.
Avg length: 4
% Conflict (adj): 0.832%
% Conflict: 0.684%
Time: 6751us
(adj) weights unique conflicts a bit heavier)

Metaphone
Avg length: 9.22
% Conflict (adj): 0.501%
% Conflict: 0.351%
Time: 5571us

"First Character Only" does surprisingly well...
Avg length: 3.92
% Conflict (adj): 0.553%
% Conflict: 0.39%
Time: 8744us

The most accurate so far has been "No Vowel"
Avg length: 12.24
% Conflict (adj): 0.166%
% Conflict: 0.101%
Time: 12841us

It should be noted that with most conflicts the difference is a version number (ought to try it with normal english data as well). Considering adding a "number" parameter.

I doubt the thread's host is this "bored" though.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
SoundEx has actually come a 2nd in efficiency and a dead last in accuracy.
Think before using a function. Judging by the average length of "4", you applied the soundex() transformation to single words only (it's fixed-width per word). This is the wrong way to use phonetic algorithms.
 

JerseyFoo

1/g = g-1
Reaction score
40
I've been using it on titles. Tried separated by '-', ' ', and no separation. If it needs to be done per word that will just skyrocket the inefficiency.
 
General chit-chat
Help Users
  • No one is chatting at the moment.

      The Helper Discord

      Staff online

      • Ghan
        Administrator - Servers are fun

      Members online

      Affiliates

      Hive Workshop NUON Dome World Editor Tutorials

      Network Sponsors

      Apex Steel Pipe - Buys and sells Steel Pipe.
      Top