Defect: Large amounts of duplicate media (possibly only ~30% original) #1398
Labels
No labels
Blocked
Bug
Cherry-picking
Database
Dependencies
Dependencies/Renovate
Difficulty
Easy
Difficulty
Hard
Difficulty
Medium
Documentation
Enhancement
Good first issue
Help wanted
Inherited
Matrix/Administration
Matrix/Appservices
Matrix/Auth
Matrix/Client
Matrix/Core
Matrix/E2EE
Matrix/Federation
Matrix/Hydra
Matrix/MSC
Matrix/Media
Matrix/T&S
Meta
Meta/CI
Meta/Packaging
Priority
Blocking
Priority
High
Priority
Low
Security
Status
Confirmed
Status
Duplicate
Status
Invalid
Status
Needs Investigation
Support
To-Merge
Wont fix
old/ci/cd
old/rust
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
continuwuation/continuwuity#1398
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I noticed this after running
jdupeson a backup of my database'smedia/folder.By a rough estimate 5000 out of 7500 are duplicates. Often this involves 5-10 copies of the same file.
Of interest, I had 77 copies of the Continuwuity dashboard.
Seems like URL previews based on that?
@Jade wrote in #1398 (comment):
Hmm. Is that where the c10y banner appears?
I think it's affecting everything, lol.
If I sort by size I see many duplicate profile pictures.
If I go to the smaller files, there are many duplicate thumbnails.
Might be something more wide then.
For what it's worth, I looked into this before disappearing last night, and figured out how we calculate the file names: it's a sha256 sum of
mxc + dimensions + content_disposition + content_type. If you're getting duplicate media with different file names, it's because one of those values has changed, which means it's actually just a new media file. As far as I can tell, this is working as intended - MXCs don't necessarily map 1:1 with a file on disk, there might be many files for one MXC depending on requested parameters.I do vaguely recall noticing a similar issue when i ran Synapse back in 2021. So it may be a difficulty faced by matrix servers in general, not just continuwuity.
Probably not worth the development effort today because media is generally much smaller that the database anyway.
I wonder if there is any harm in using
jdupesto replace duplicates with links to one agreed original? This would be a relatively easy workaround to free up the few hundred MB or more used by duplicates.What's also interesting is at least on my ext4 system, some of the "same" images appear to have unequal sizes (byte differences).
You can see in my screenshots, when I sort by size, sometimes the stream of duplicates is interrupted by an unrelated image whose size was between the stream (whose members, I assume, ought to have exact byte sizes and matches).
Likely because the images are a different size / dimension