Membership is FREE, giving all registered users unlimited access to every Acorn Domains feature, resource, and tool! Optional membership upgrades unlock exclusive benefits like profile signatures with links, banner placements, appearances in the weekly newsletter, and much more - customized to your membership level!

Any MySQL wizards here?

Joined
Oct 15, 2012
Posts
1,325
Reaction score
191
I've got a big ol' MySQL database here, one table with over 1,000,000 rows in a single column, each row contains a single sentence. Is there a query I can run to bring back a list of the most common words in the entire table?
 
Not that I can think of. You'd be better off pulling results into php and dealing with it there.

Or depending on the reasoning for wanting to do it, download a csv backup and copy pasting it all into a word cloud site?
 
Hmmm OK... we're talking manyt millions of individual words here so I think PHP might buckle!
 
It's been a while since i used MySQL but would something like this work?

SELECT ColumnName, COUNT(ColumnName) AS ColumnName_Count
FROM table_name
GROUP BY ColumnName
ORDER BY ColumnName_Count DESC;
 
Loop through the rows - explode the sentence into words - increment an associative array of the words. At the end you'll have an array of all the words and their count.
 
It's been a while since i used MySQL but would something like this work?

SELECT ColumnName, COUNT(ColumnName) AS ColumnName_Count
FROM table_name
GROUP BY ColumnName
ORDER BY ColumnName_Count DESC;

I think that will only count exact record match? I don't think it'll help counting words in sentences.

Loop through the rows - explode the sentence into words - increment an associative array of the words. At the end you'll have an array of all the words and their count.

Thanks Rob. If anyone can knock that up for me for a reasonable fee please let me know!
 
Something like below: note you have to connect to your database and also this example only separates words by a space character.

<?
$query="SELECT column FROM table;";
$result=mysql_query($query);
while ($row=mysql_fetch_assoc($result))
{
$words=explode(" ",strtolower($row['column']));
for ($i=0;$i<count($words);$i++)
{
$array[$words[$i]]++;
}
}
print_r($array);
?>
 
Lovely stuff Rob, sure I can get the database connection stuff done :)
 
If you don't care how the output is formatted I'd do a string replace to delete everything in the row that's not alphanumeric - that takes care of punctuation
 
Getting there I think, but it's throwing up errors for every word like:

Notice: Undefined index: hotel in C:\wamp64\www\mysqchk\check2.php on line 26

Line 26 is:

$array[$words[$i]]++;

Anything obvious that I'm missing?
 
Change that line to:
if (isset($array[$words[$i]])) $array[$words[$i]]++; else $array[$words[$i]]=1;

I think it's only a warning anyway - the code would still function but this is safe
 
Last edited:
Also if you put arsort($array); before your print it will sort the array into descending order of amounts - ie most common word first
 
So close I can smell it, I'm getting a memory error:

Allowed memory size of 134217728 bytes exhausted

I'm guessing because my database is so bloody big, there must be something like 5million+ words in there. I'll raise the PHP memory limit and see what happens.
 
Yeah increase the limit in your php.ini file or probably better to use a loop in your SELECT and a LIMIT to get a smaller amount - eg groups of 50k if the memory error is coming from selecting all records. However this will make no difference if the memory error is from the array.
 
It's feasible to do this within the database itself rather than pulling a million rows into PHP - and dealing with the memory usage that entails!

I found this which almost does what you want: https://dba.stackexchange.com/a/82456 - I modified it slightly to add counts and order by most frequent words, leaving us with this query:

Code:
SELECT
  COUNT(*) as num,
  SUBSTRING_INDEX(SUBSTRING_INDEX(logs.useragent, ' ', numbers.n), ' ', -1) logs
FROM
  (SELECT 1 n UNION ALL SELECT 2
  UNION ALL SELECT 3 UNION ALL SELECT 4) numbers INNER JOIN logs
  ON CHAR_LENGTH(logs.useragent)-CHAR_LENGTH(REPLACE(logs.useragent, ' ', '')) >= numbers.n-1
GROUP BY logs
ORDER BY num DESC
LIMIT 20;

I tested it on a table I have with ~65,000 rows, the table name is 'logs' and the column I was searching is 'useragent' - replace these with the appropriate ones for your table. If you need to replace punctuation I'm sure that can be done too. The output for me looked something like this:

Code:
+-------+-----------------------------------------+
| num   | logs                                    |
+-------+-----------------------------------------+
| 38314 | Mozilla/5.0                             |
| 17251 | (compatible;                            |
| 12233 | NT                                      |
| 12211 | (Windows                                |
|  7452 | MSIE                                    |
|  6167 | 6.1;                                    |
|  5831 | U;                                      |
|  5718 | Mozilla/4.0                             |
|  5379 | Linux                                   |
|  5367 | (X11;                                   |
|  4365 | Windows                                 |
|  4355 | (Windows;                               |
|  4014 | Googlebot/2.1;                          |
|  4007 | +http://www.google.com/bot.html)        |
|  3822 | x86_64)                                 |
|  3581 | Stratagems                              |
|  3581 | Kumo                                    |
|  3205 |                                         |
|  2821 | http://www.nominet.org.uk/privacypolicy |
|  2803 | (Macintosh;                             |
+-------+-----------------------------------------+

That took just over 2 seconds to run on the ~65,000 rows which take up about 10MB of space.

It'll be interesting to see how this scales to a 1M row table!
 
I fear I may have to admit defeat here... PHP still laughed at me when I raised to 1GB memory, and phpmyadmin/MySQL crapped out after 10 minutes running the query above (it just returned to a blank query input screen).

I think the only way I can make this happen is to:

1. Have a seperate counting table (one column for words, one for the count) to input new words and +1 the count for words already there
2. Have a new column in my data table where I can flag that row has already been parsed before. This way I can keep running the script, and it'll skip rows that have already been counted and entered into the counting table

Makes sense in my head, but it's beyond my PHP skills.
 
Did you put the select in a loop? If that is the cause of your memory problem it will fix it. But if the array is using up your memory you can always have another table with each word in it and a count for it - no array memory. If you want to PM me any relevant info I'll sort it for you.
 
Thank you Rob that would be very kind, I'll ping you a PM tomorrow but I don't want to take the piss so tell me when I'm getting close ;)

I haven't had a play today, I've been in a deep funk with the death of Chris Cornell, a true hero of mine from my youth :(
 
Yeah I've been listening to Audioslave all day - never really liked soundgarden much but love RATM.
 

The Rule #1

Do not insult any other member. Be polite and do business. Thank you!

Premium Members

New Threads

Domain Forum Friends

Our Mods' Businesses

*the exceptional businesses of our esteemed moderators
General chit-chat
Help Users
  • No one is chatting at the moment.
      There are no messages in the current room.
      Top Bottom