MySQL and Friendly URLs

enouwee

Non ex transverso sed deorsum
Reaction score
240
Let's do another little experiment.

I start out with my previous setup I've been using for 1 million entries.

1) The "words" consists of:

Code:
mysql> desc test_words;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int(32)      | NO   | PRI | NULL    |       |
| word  | varchar(255) | NO   | UNI | NULL    |       |
+-------+--------------+------+-----+---------+-------+

mysql> select count(1) from test_words;
+----------+
| count(1) |
+----------+
|     9993 |
+----------+

mysql> select avg(length(word)) from test_words;
+-------------------+
| avg(length(word)) |
+-------------------+
|            6.3832 |
+-------------------+
1 row in set (0.01 sec)

2) Now I create two additional tables, one being fixed-width, the other varchar-based:

Code:
create table test_words_char (
  id integer(32) primary key auto_increment,
  word char(9) not null,
  constraint unique test_words_char_u1(word)
);

create index test_words_char_word on test_words_char(word);

create table test_words_varchar (
  id integer(32) primary key auto_increment,
  word varchar(9) not null,
  constraint unique test_words_varchar_u1(word)
);

create index test_words_varchar_word on test_words_varchar(word);

3) On disk, we've got:

Code:
12      test_words_char.frm
0       test_words_char.MYD
4       test_words_char.MYI

12      test_words.frm
196     test_words.MYD
324     test_words.MYI

12      test_words_varchar.frm
0       test_words_varchar.MYD
4       test_words_varchar.MYI


4) Let's add some data:
Note that I've had to filter out some rows because of duplicate values:

Code:
mysql> insert into test_words_char (word) select distinct left(word, 9) from test_words;
Query OK, 9777 rows affected (0.14 sec)
Records: 9777  Duplicates: 0  Warnings: 0

mysql> insert into test_words_varchar (word) select distinct left(word, 9) from test_words;
Query OK, 9777 rows affected (0.13 sec)
Records: 9777  Duplicates: 0  Warnings: 0

Thankfully the length is the same:

Code:
mysql> select avg(length(word)) from test_words_char;
+-------------------+
| avg(length(word)) |
+-------------------+
|            6.1663 |
+-------------------+
1 row in set (0.00 sec)

mysql> select avg(length(word)) from test_words_varchar;
+-------------------+
| avg(length(word)) |
+-------------------+
|            6.1663 |
+-------------------+
1 row in set (0.01 sec)

And so are the words inside:
Code:
mysql> select count(1) from test_words_char c where not exists (select 1 from test_words_varchar v where v.word = c.word);
+----------+
| count(1) |
+----------+
|        0 |
+----------+
1 row in set (0.09 sec)

5) How much disk space do the various tables use?

Code:
12      test_words_char.frm
320     test_words_char.MYD
256     test_words_char.MYI
[B]588     tota[/B]l

12      test_words_varchar.frm
192     test_words_varchar.MYD
256     test_words_varchar.MYI
[B]460     tota[/B]l

12      test_words.frm
196     test_words.MYD
324     test_words.MYI
[B]532     total[/B]

It shouldn't come as a surprise that the varchar(9) table uses the least space. I hope you're not too surprised that the constant-length table wastes quite a lot of bytes because the most common words are shorter than 9 characters. Nevertheless, it's funny to note that the original table with all and untruncated rows occupies less that the constant-length table.

What does this tell you? You should only use char(...) when your data is fixed length, like 14-digit coded dates: 20090120123000.

But I'm sure you'll tell us why the fixed-length strings will perfectly make sense in your case. After all, aren't you aiming at saving maybe 50-100KB on the words table when the additional join indexes you need already take up more than 10MB?
 

JerseyFoo

1/g = g-1
Reaction score
40
Static tables are supposedly stabler and faster, perhaps not for string searches. Either way I'd like to try it.

Nevertheless, it's funny to note that the original table with all and untruncated rows occupies less that the constant-length table.
Not if your talking index.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
Static tables are supposedly stabler and faster, perhaps not for string searches. Either way I'd like to try it.
I knew you'd find something. :rolleyes:

Not if your talking index.
I'm referring to the whole space required of course, but mostly to the tables, as they're of different structure. The indexes are exactly the same for both tables and thus require the same space per row, whether your columns are char or varchar.
 

JerseyFoo

1/g = g-1
Reaction score
40
I need an algorithm to most fastly and effectively figure out uniqueness of English titles in URI format

"This Is Most Common Title" = "this-is-most-common-title" ~ 1
"This Is 2nd Most Common Title" = "this-is-2nd-most-common-title" ~ 2

...would be maximum efficiency, however it needs to be done nearly instantly upon request, so has to be rough.

Example: By first char of each word

"this-is-most-common-title" ~ timct
"this-is-2nd-most-common-title" ~ ti2mct

Can start there :D
 

Samuraid

Advisor
Reaction score
81
If you have a base string and you need to figure out the distance to other strings, you might try something like this:
http://php.net/levenshtein

Otherwise, the "best" algorithm you use completely depends on the data patterns you expect in the specific URLs on your site.

If you are already calculating the formatted URL form (i.e. "this-is-most-common-title"), you might as well just compare the entire formatted URL string with others as it takes roughly linear time, and your algorithm to get the first characters of each word also takes roughly linear time, but doesn't run in place (it creates extra strings) and requires another comparison on top of it to compare the smaller strings it generates.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
If you have a base string and you need to figure out the distance to other strings, you might try something like this:
http://php.net/levenshtein
Unfortunately, the only string similarity operator available in MySQL is soundex(), which should be a must in this case ([THREAD=116981]see here[/THREAD] for the first part of the story). It'd be easy with Oracle as backend, as it provides (double) metaphone as standard function.

You may implement the Levenshtein distance as stored procedure, but you'll still be dependent on other criteria (range scans) for good performance. Even with MySQL, soundex could at least be used as indexed column.
 

JerseyFoo

1/g = g-1
Reaction score
40
SoundEx has actually come a 2nd in efficiency ( by length, anyway ) and a dead last in accuracy.
Avg length: 4
% Conflict (adj): 0.832%
% Conflict: 0.684%
Time: 6751us
(adj) weights unique conflicts a bit heavier)

Metaphone
Avg length: 9.22
% Conflict (adj): 0.501%
% Conflict: 0.351%
Time: 5571us

"First Character Only" does surprisingly well...
Avg length: 3.92
% Conflict (adj): 0.553%
% Conflict: 0.39%
Time: 8744us

The most accurate so far has been "No Vowel"
Avg length: 12.24
% Conflict (adj): 0.166%
% Conflict: 0.101%
Time: 12841us

It should be noted that with most conflicts the difference is a version number (ought to try it with normal english data as well). Considering adding a "number" parameter.

I doubt the thread's host is this "bored" though.
 

enouwee

Non ex transverso sed deorsum
Reaction score
240
SoundEx has actually come a 2nd in efficiency and a dead last in accuracy.
Think before using a function. Judging by the average length of "4", you applied the soundex() transformation to single words only (it's fixed-width per word). This is the wrong way to use phonetic algorithms.
 

JerseyFoo

1/g = g-1
Reaction score
40
I've been using it on titles. Tried separated by '-', ' ', and no separation. If it needs to be done per word that will just skyrocket the inefficiency.
 
General chit-chat
Help Users
  • No one is chatting at the moment.
  • Varine Varine:
    I ordered like five blocks for 15 dollars. They're just little aluminum blocks with holes drilled into them
  • Varine Varine:
    They are pretty much disposable. I have shitty nozzles though, and I don't think these were designed for how hot I've run them
  • Varine Varine:
    I tried to extract it but the thing is pretty stuck. Idk what else I can use this for
  • Varine Varine:
    I'll throw it into my scrap stuff box, I'm sure can be used for something
  • Varine Varine:
    I have spare parts for like, everything BUT that block lol. Oh well, I'll print this shit next week I guess. Hopefully it fits
  • Varine Varine:
    I see that, despite your insistence to the contrary, we are becoming a recipe website
  • Varine Varine:
    Which is unique I guess.
  • The Helper The Helper:
    Actually I was just playing with having some kind of mention of the food forum and recipes on the main page to test and see if it would engage some of those people to post something. It is just weird to get so much traffic and no engagement
  • The Helper The Helper:
    So what it really is me trying to implement some kind of better site navigation not change the whole theme of the site
  • Varine Varine:
    How can you tell the difference between real traffic and indexing or AI generation bots?
  • The Helper The Helper:
    The bots will show up as users online in the forum software but they do not show up in my stats tracking. I am sure there are bots in the stats but the way alot of the bots treat the site do not show up on the stats
  • Varine Varine:
    I want to build a filtration system for my 3d printer, and that shit is so much more complicated than I thought it would be
  • Varine Varine:
    Apparently ABS emits styrene particulates which can be like .2 micrometers, which idk if the VOC detectors I have can even catch that
  • Varine Varine:
    Anyway I need to get some of those sensors and two air pressure sensors installed before an after the filters, which I need to figure out how to calculate the necessary pressure for and I have yet to find anything that tells me how to actually do that, just the cfm ratings
  • Varine Varine:
    And then I have to set up an arduino board to read those sensors, which I also don't know very much about but I have a whole bunch of crash course things for that
  • Varine Varine:
    These sensors are also a lot more than I thought they would be. Like 5 to 10 each, idk why but I assumed they would be like 2 dollars
  • Varine Varine:
    Another issue I'm learning is that a lot of the air quality sensors don't work at very high ambient temperatures. I'm planning on heating this enclosure to like 60C or so, and that's the upper limit of their functionality
  • Varine Varine:
    Although I don't know if I need to actually actively heat it or just let the plate and hotend bring the ambient temp to whatever it will, but even then I need to figure out an exfiltration for hot air. I think I kind of know what to do but it's still fucking confusing
  • The Helper The Helper:
    Maybe you could find some of that information from AC tech - like how they detect freon and such
  • Varine Varine:
    That's mostly what I've been looking at
  • Varine Varine:
    I don't think I'm dealing with quite the same pressures though, at the very least its a significantly smaller system. For the time being I'm just going to put together a quick scrubby box though and hope it works good enough to not make my house toxic
  • Varine Varine:
    I mean I don't use this enough to pose any significant danger I don't think, but I would still rather not be throwing styrene all over the air

      The Helper Discord

      Members online

      No members online now.

      Affiliates

      Hive Workshop NUON Dome World Editor Tutorials

      Network Sponsors

      Apex Steel Pipe - Buys and sells Steel Pipe.
      Top