Oh boy, do I have a doozy of a rod for my back.
I’m working on a node app that scrapes a backup of a phpbb forum website, then rebuilds that site in a clean and static form. I’ve got the scraping going just fine, and it’s spitting into a database. Now comes the time to rebuild.
I was using SQLite, which worked but was (unsurprisingly) slow as heck, what with the ~300,000 posts this site had. So I’m migrating it to MySQL. And here’s where I’ve run into something of a roadblock.
My tables look like this:
name | datatype | description |
---|---|---|
id | integer | primary key |
name | string | Forum name |
description | string | Basic description |
parentId | integer | ID of parent forum or NULL |
name | datatype | description |
---|---|---|
id | integer | primary key |
name | string | Topic name |
dateCreated | datetime | When the topic was created |
userId | integer | ID of the user that created this topic |
forumId | integer | ID of parent forum |
name | datatype | description |
---|---|---|
id | integer | primary key |
subject | string | Post subject |
datePosted | datetime | When the post was posted |
body | text | body of the post |
userId | integer | ID of the user that created this post |
topicId | integer | ID of parent topic |
name | datatype | description |
---|---|---|
id | integer | primary key |
name | string | User’s displayed name |
avatar | string | URL of user’s avatar image |
Seems simple, right? So now comes to the crunch. When building the forum index, I need some values. For instance:
And to make this, I used this hideous SQL query:
WITH
TopicCounts as (
SELECT COUNT(TPC.id) AS TopicCountTopics, SUM(TPC.PostCount) AS PostCount, TPC.ForumId
FROM (
SELECT Topics.id, Topics.ForumId, COUNT(Posts.id) AS PostCount
FROM Topics
LEFT JOIN Posts ON Posts.TopicId = Topics.id
GROUP BY Topics.id
ORDER BY Topics.ForumId
) AS TPC
GROUP BY TPC.ForumId
),
LatestTopics AS (
SELECT Topics.name AS LatestPostName,
Topics.id AS LatestTopicId,
Users.name AS LatestUserName,
Posts.datePosted AS LatestPostDate,
Topics.ForumId
FROM Topics
INNER JOIN Posts ON Posts.TopicId = Topics.id
AND Posts.id IN (
SELECT MAX(PP.id)
FROM Posts AS PP
GROUP BY PP.TopicId
)
INNER JOIN Users ON Users.id = Posts.UserId
ORDER BY Posts.datePosted DESC
)
SELECT Forums.id,
Forums.name,
Forums.description,
Forums.parentId,
TopicCounts.TopicCountTopics AS TopicCount,
TopicCounts.PostCount AS PostCount,
LatestTopics.LatestPostName,
LatestTopics.LatestPostDate,
LatestTopics.LatestUserName,
MAX(LatestTopics.LatestTopicId)
FROM Forums
LEFT JOIN TopicCounts ON TopicCounts.ForumId = Forums.id
LEFT JOIN LatestTopics ON LatestTopics.ForumId = Forums.id
GROUP BY Forums.id
ORDER BY Forums.parentId, Forums.id
Which worked fantastically in SQLite, especially with the orderings and returning the top result. MySQL, however, is not behaving, complaining that expression of select list is not in group by clause and contains nonaggregated column
. And “fixing” that by telling it that’s okay isn’t helping, because the aggregated results are being returned in arbitrary order.
For consistency’s sake, I can say that IDs are sequential, so MAX(Posts.id)
will always return the latest post. I also understand this is a non-trivial thing for SQL, but I’m hoping that with modern systems like MySQL 8 here, it can be made to work.
Execution time is not particularly important, as this is being run to create a static site, so it’s a one-time query, but doing all the aggregation by hand, especially with a table involving 300 thousand rows, is going to be a little daunting.
Any coders care to help? Feel free to @ me on Mastodon or find me in the LRR discord.
With some assistance from SnyperWolf, I have the following:
WITH
TopicCounts as (
SELECT COUNT(TPC.id) AS TopicCountTopics, SUM(TPC.PostCount) AS PostCount, TPC.ForumId
FROM (
SELECT Topics.id, Topics.ForumId, COUNT(Posts.id) AS PostCount
FROM Topics
LEFT JOIN Posts ON Posts.TopicId = Topics.id
GROUP BY Topics.id
ORDER BY Topics.ForumId
) AS TPC
GROUP BY TPC.ForumId
),
LatestTopics AS (
SELECT TM.name AS LatestPostName,
TM.id AS LatestTopicId,
Users.name AS LatestUserName,
TM.datePosted AS LatestPostDate,
TM.ForumId
FROM
(
SELECT TT.name, TT.id, TT.ForumId, PM.datePosted, PM.UserId, ROW_NUMBER() OVER (PARTITION BY TT.ForumId ORDER BY PM.datePosted DESC) AS rn
FROM Topics AS TT
INNER JOIN Posts PM ON PM.TopicId = TT.id
AND PM.id IN (
SELECT MAX(PP.id)
FROM Posts AS PP
GROUP BY PP.TopicId
)
) AS TM
INNER JOIN Users ON Users.id = TM.UserId
WHERE rn=1
ORDER BY TM.datePosted DESC
)
SELECT Forums.id,
Forums.name,
Forums.description,
Forums.parentId,
TopicCounts.TopicCountTopics AS TopicCount,
TopicCounts.PostCount AS PostCount,
LatestTopics.LatestPostName,
LatestTopics.LatestPostDate,
LatestTopics.LatestUserName,
LatestTopics.LatestTopicId
FROM Forums
LEFT JOIN TopicCounts ON TopicCounts.ForumId = Forums.id
LEFT JOIN LatestTopics ON LatestTopics.ForumId = Forums.id
ORDER BY Forums.parentId, Forums.id
Which seems to work perfectly with MySQL. And I’m not going to think about getting that same query to work with SQLite.
Posted 2025-08-14 by Robert "Anaerin" Johnston