Defect: Large amounts of duplicate media (possibly only ~30% original) #1398

Open
opened 2026-02-17 01:21:52 +00:00 by gamesguru · 7 comments
Contributor

I noticed this after running jdupes on a backup of my database's media/ folder.

By a rough estimate 5000 out of 7500 are duplicates. Often this involves 5-10 copies of the same file.

$ jdupes -r . > output.log
Scanning: 7530 files, 1 items (in 2 specified)

$ wc -l output.log 
5097 output.log

$ ls | wc -l
7532

Of interest, I had 77 copies of the Continuwuity dashboard.

1GWpVuvemrnh7QlOGpvsgO-faCCE-Fzx48u9Mw3VX0E
1dBtmFWRMo7BpJvJn5uRslutSvz-u9El2GMnK6nPaQk
3EL0axAUrLG39s3dAfXGeAWObunA1z6O7s-DkIJI8jw
3Xc1tT9Fc8bz7PUUGE_AP9eCbPV1Y7DTe8JwinVFZ48
5ItTbTjLg8_vq5xG9jg4FKKK8hXq6Y2TfjUmxZp_AvA
7gMgDqkpCGPnqNLSHTwP4ZMqozPMkEpe1j_MYowEw3o
7-V5XQ2hwGHOQYofCG12paLpgGjslapN-VmIL_by7lY
9iNINOZcyxsOpFIBXAEt9RweJNA-6GvLGR4M2mRG14E
9tJOFbvDfCHA6flzVJ_-viCRYx8P8C5Ry7SeeMQYHxo
9xYnbDjnbgZJaq5kHtq-mkuJuJ9Wt_ZB9X8_FmakUj0
69Zkbni2cCudJytboExSuub9vRtN1W6JKxG5MF3D_Ro
Ag_kxd1qZLrjuoZdmpY5KrFDTgMwDrCBztxSOGUAhOQ
B71CUgYK8G7JkyHmI3mZhzKzlh4VNT2r3kO8REWCw3s
BTG46mhFvEwUgwiY3xakQMLytYq3eK7ZGrPSNb0F_Yg
BrT7VTAVAc7bxG0kebyryucuOzZZfAb5qXkB_GgEfow
D2VBdAF0ob-Ze6iV2-cbdhxPuEa6pE3cHYAXd9kRQg0
DV3FlDTlqC7rcRmjoTz4eV5RPFESwRO0z6yWMWGOsAw
FHdUM5I-9IFSqic6lfThgE1b0qT8s-Hpe8Ls_wHCDbM
Gl7dAkmddUnq-tXF5r3rd9nAw_a4zu7ad_eyNx9jgmc
H92CPO2nXVWDiB_5oMXbD0SmZ3zX1m5lgJIJa9yQok0
HrzDlGcTqv9poCOJzdG2TXonWURYJXOBQ3EhXsAp1cs
Ktoy4JI8TXTzHrBqfsQ75061HRoWQTf0Wf6lkNI-HSc
M2A7raPwVNxfMIb0-lpad-j4P0FgNSY285A11G7GoRA
M8iqYeCnMSL6cLXteCkaizuYLiK90XVwSLBNCUFszIs
MhBOI2u1uHOa3qqNYNwLKLWaz6kVi7toRRFvoXuy6Jw
NR9yT44anp_dQuyAbBeFIZUeqDmPibRz0OTjAY8oXqw
O0N8E8WriUSgg23nhTv4KJcF0WAqICD7QpWKdoCEDnQ
Oxc9dhZgIMIkFrCwlrwpiKRFUvfT8I8169eGBRMbRJc
PMGiZlknjYrj4bDp0sNx7TbRkR6d0pG9JBhdEZ-axLg
PMZPwLoQMKC-ESvB3yfpE9aHv7n-_oTtHGFuOSPiQuY
Ptf8kecgLfIALEKviOsAa-mWWfDVcmQCnVulKSdrg7M
QIGz-7nR8xs8Nk0MBK820pCWEdhw44IOaLN2bRwuUhg
Qgi8jJz0a5AK8-4b6Bvn9sDoYBEyK05WOSH9CrC9A5Y
QtvbAdIHMXixOFrOG3gApWGHaupKjve-8LwcrF4X0no
R7BhEuBM0zrkIWH8SmWc3IQPPakFwbvoOahQsj1GSPo
S2YOExWCILgZdBTd3hg_xHa0zRp0kD4endwR0PZb5vI
U5n26Tj-fnjTzPn4zxRtvUan8gDxQsl2l5ly2UOGqdc
UJBBODm5PygB3EvwApYnvSu38cM5QXjZ3lrWMxaVQaU
YLxSskT7llVGs3OZdZyvfNN20clIqfsZj7ZhmgC2eLM
_8dgrvUYFIZD6JhaRG-jr6AI_DB1RnBucJRhR70DA-k
_ICVyin2sLXtTZM78H75agzNONp6rlINaz3755h5oso
__hK-z2Pq1Zag-wktmdIrrl1SZDetIoWGEMZYVnpass
bSypu2JwQOt6bepyem4kh-FoYB_WUH8dJncix21FFIw
cQGuVDoqCYqfUyeWxh0FkI3a5K8ObHKsiIZclmVNOgA
dDJK_gw7RWiwPPPHRamlew0iroqUcITYRg7L695ZRAw
dNo1lpZ9ddQD7C_A_GjHt5blev7tMcqJbHlyrhH7XcU
eRPDQedvhYsoG52vSzUjLopuP_574d_ioHZRBn-Lc6U
fAIUEhD-K52hzrJxlUj-Nxgf0w94YICvZf9b5w-zJDs
fYEFatdcEkjH6Qu72cNmZqBzjlERHbkZGA46NBhxJNY
jG50sOxZSeP_w9l3x6SKMPstxf0BBAmdm5asAVdaSDM
lg1LJ3c5-0hbnsHNSeQiCFFTUsA8aAmAjDRRWwKc_fY
lhOJzU4LFTCSp572PjGsJmkjejUZu_gbzyrRlk7C-hI
lhRxuWikPiJti-W8KxsZyQcwftcFFjK7T56z9tH8B1I
m0xfwmoGNJJdrYLZvdOJSuURdz2NQIEO71MuPuBsAY4
mI8YgQaC9VEPba5k_Moy953S1K8AUDRcussMeaFE-1M
p7v6kCLyyvRacj0OraXjk1EZF7Qmrvbzus7YlhpgtJ4
p_8hnGx7F1rxCk5uoP9t4vJt-KGXIL2iNI-2ibPwVkY
pikvz8xapahyr8q1chR3rMiTu3Vv2y8JUNo9tkgMlBg
q5nEmdZcL4uNTJByHaNOb6M1Xe2hsygP6jbMt52vCO8
qUqM0yEI_l-WDgqxCjGpAlBWkBkhJvxBvGukkeAi75w
qVYaUpX2E2kqUgfM0CtbuLFfIPVuTpSlcADkifawEVo
r05LY5T22nzCox9f0meRSRCCR6uctfP11baTjgqa-B0
tHODC9vZaxKGAm_fIXKC-l5n3v44nxtX_h7J0Px9Sbc
thZ4Qq9SUtYauybRRK4v7lF9pPLBBoX4btbMbN-uKsg
uW62fWwhw454dEAfcxZfgNSls41-ITYjPDibzKEsV14
uiKsNDzAplE35f1Ggk_58Du3QsQDsTBuWRMdkp6g9JM
wTfscu6LkkE68zY_PXUUcq1C2eTDCmkOwJ4AyiVBOEs
x3FoHgt7VZJTXRcYhQsuiH_X-p_qgnbCWo7tX2HzNI8
x91iUD6voblCfLhuni-da38COT7xA-tQTFF6WxACD3w
xRsp2Ye-m-TmsnwYOeWvd9IxFIcXaL3un8qHCJyopws
yorhh_ONVblFtESlVicVth0icLtavRQNJDzsDMdrV6M
yotokdgjDALIlxyw-U2HrrkUXZRsLrmFZ7Adq13Bfsk
ytPRyXcku5uxZTg82U67A-dLUqRC-ABtYj32JuKfRro
zdxtjy2Vr1jLpLhiR03csMYb6xrMhhkw3VRN6tIsggc
-Ci3746t9n6VdbxARnzGuJUzc_o54gl4JVGzZT89w0Y
-EJhyq2ls8iRnzJki41m66rRr-tf94LDTM4WT2LgNAE
-dwFaS-ljcKFiYIrfgcloAR9ec9yCFsZkKx_icWQOlg
I noticed this after running `jdupes` on a backup of my database's `media/` folder. By a rough estimate 5000 out of 7500 are duplicates. Often this involves 5-10 copies of the same file. ```shell $ jdupes -r . > output.log Scanning: 7530 files, 1 items (in 2 specified) $ wc -l output.log 5097 output.log $ ls | wc -l 7532 ``` Of interest, I had 77 copies of the Continuwuity dashboard. ``` 1GWpVuvemrnh7QlOGpvsgO-faCCE-Fzx48u9Mw3VX0E 1dBtmFWRMo7BpJvJn5uRslutSvz-u9El2GMnK6nPaQk 3EL0axAUrLG39s3dAfXGeAWObunA1z6O7s-DkIJI8jw 3Xc1tT9Fc8bz7PUUGE_AP9eCbPV1Y7DTe8JwinVFZ48 5ItTbTjLg8_vq5xG9jg4FKKK8hXq6Y2TfjUmxZp_AvA 7gMgDqkpCGPnqNLSHTwP4ZMqozPMkEpe1j_MYowEw3o 7-V5XQ2hwGHOQYofCG12paLpgGjslapN-VmIL_by7lY 9iNINOZcyxsOpFIBXAEt9RweJNA-6GvLGR4M2mRG14E 9tJOFbvDfCHA6flzVJ_-viCRYx8P8C5Ry7SeeMQYHxo 9xYnbDjnbgZJaq5kHtq-mkuJuJ9Wt_ZB9X8_FmakUj0 69Zkbni2cCudJytboExSuub9vRtN1W6JKxG5MF3D_Ro Ag_kxd1qZLrjuoZdmpY5KrFDTgMwDrCBztxSOGUAhOQ B71CUgYK8G7JkyHmI3mZhzKzlh4VNT2r3kO8REWCw3s BTG46mhFvEwUgwiY3xakQMLytYq3eK7ZGrPSNb0F_Yg BrT7VTAVAc7bxG0kebyryucuOzZZfAb5qXkB_GgEfow D2VBdAF0ob-Ze6iV2-cbdhxPuEa6pE3cHYAXd9kRQg0 DV3FlDTlqC7rcRmjoTz4eV5RPFESwRO0z6yWMWGOsAw FHdUM5I-9IFSqic6lfThgE1b0qT8s-Hpe8Ls_wHCDbM Gl7dAkmddUnq-tXF5r3rd9nAw_a4zu7ad_eyNx9jgmc H92CPO2nXVWDiB_5oMXbD0SmZ3zX1m5lgJIJa9yQok0 HrzDlGcTqv9poCOJzdG2TXonWURYJXOBQ3EhXsAp1cs Ktoy4JI8TXTzHrBqfsQ75061HRoWQTf0Wf6lkNI-HSc M2A7raPwVNxfMIb0-lpad-j4P0FgNSY285A11G7GoRA M8iqYeCnMSL6cLXteCkaizuYLiK90XVwSLBNCUFszIs MhBOI2u1uHOa3qqNYNwLKLWaz6kVi7toRRFvoXuy6Jw NR9yT44anp_dQuyAbBeFIZUeqDmPibRz0OTjAY8oXqw O0N8E8WriUSgg23nhTv4KJcF0WAqICD7QpWKdoCEDnQ Oxc9dhZgIMIkFrCwlrwpiKRFUvfT8I8169eGBRMbRJc PMGiZlknjYrj4bDp0sNx7TbRkR6d0pG9JBhdEZ-axLg PMZPwLoQMKC-ESvB3yfpE9aHv7n-_oTtHGFuOSPiQuY Ptf8kecgLfIALEKviOsAa-mWWfDVcmQCnVulKSdrg7M QIGz-7nR8xs8Nk0MBK820pCWEdhw44IOaLN2bRwuUhg Qgi8jJz0a5AK8-4b6Bvn9sDoYBEyK05WOSH9CrC9A5Y QtvbAdIHMXixOFrOG3gApWGHaupKjve-8LwcrF4X0no R7BhEuBM0zrkIWH8SmWc3IQPPakFwbvoOahQsj1GSPo S2YOExWCILgZdBTd3hg_xHa0zRp0kD4endwR0PZb5vI U5n26Tj-fnjTzPn4zxRtvUan8gDxQsl2l5ly2UOGqdc UJBBODm5PygB3EvwApYnvSu38cM5QXjZ3lrWMxaVQaU YLxSskT7llVGs3OZdZyvfNN20clIqfsZj7ZhmgC2eLM _8dgrvUYFIZD6JhaRG-jr6AI_DB1RnBucJRhR70DA-k _ICVyin2sLXtTZM78H75agzNONp6rlINaz3755h5oso __hK-z2Pq1Zag-wktmdIrrl1SZDetIoWGEMZYVnpass bSypu2JwQOt6bepyem4kh-FoYB_WUH8dJncix21FFIw cQGuVDoqCYqfUyeWxh0FkI3a5K8ObHKsiIZclmVNOgA dDJK_gw7RWiwPPPHRamlew0iroqUcITYRg7L695ZRAw dNo1lpZ9ddQD7C_A_GjHt5blev7tMcqJbHlyrhH7XcU eRPDQedvhYsoG52vSzUjLopuP_574d_ioHZRBn-Lc6U fAIUEhD-K52hzrJxlUj-Nxgf0w94YICvZf9b5w-zJDs fYEFatdcEkjH6Qu72cNmZqBzjlERHbkZGA46NBhxJNY jG50sOxZSeP_w9l3x6SKMPstxf0BBAmdm5asAVdaSDM lg1LJ3c5-0hbnsHNSeQiCFFTUsA8aAmAjDRRWwKc_fY lhOJzU4LFTCSp572PjGsJmkjejUZu_gbzyrRlk7C-hI lhRxuWikPiJti-W8KxsZyQcwftcFFjK7T56z9tH8B1I m0xfwmoGNJJdrYLZvdOJSuURdz2NQIEO71MuPuBsAY4 mI8YgQaC9VEPba5k_Moy953S1K8AUDRcussMeaFE-1M p7v6kCLyyvRacj0OraXjk1EZF7Qmrvbzus7YlhpgtJ4 p_8hnGx7F1rxCk5uoP9t4vJt-KGXIL2iNI-2ibPwVkY pikvz8xapahyr8q1chR3rMiTu3Vv2y8JUNo9tkgMlBg q5nEmdZcL4uNTJByHaNOb6M1Xe2hsygP6jbMt52vCO8 qUqM0yEI_l-WDgqxCjGpAlBWkBkhJvxBvGukkeAi75w qVYaUpX2E2kqUgfM0CtbuLFfIPVuTpSlcADkifawEVo r05LY5T22nzCox9f0meRSRCCR6uctfP11baTjgqa-B0 tHODC9vZaxKGAm_fIXKC-l5n3v44nxtX_h7J0Px9Sbc thZ4Qq9SUtYauybRRK4v7lF9pPLBBoX4btbMbN-uKsg uW62fWwhw454dEAfcxZfgNSls41-ITYjPDibzKEsV14 uiKsNDzAplE35f1Ggk_58Du3QsQDsTBuWRMdkp6g9JM wTfscu6LkkE68zY_PXUUcq1C2eTDCmkOwJ4AyiVBOEs x3FoHgt7VZJTXRcYhQsuiH_X-p_qgnbCWo7tX2HzNI8 x91iUD6voblCfLhuni-da38COT7xA-tQTFF6WxACD3w xRsp2Ye-m-TmsnwYOeWvd9IxFIcXaL3un8qHCJyopws yorhh_ONVblFtESlVicVth0icLtavRQNJDzsDMdrV6M yotokdgjDALIlxyw-U2HrrkUXZRsLrmFZ7Adq13Bfsk ytPRyXcku5uxZTg82U67A-dLUqRC-ABtYj32JuKfRro zdxtjy2Vr1jLpLhiR03csMYb6xrMhhkw3VRN6tIsggc -Ci3746t9n6VdbxARnzGuJUzc_o54gl4JVGzZT89w0Y -EJhyq2ls8iRnzJki41m66rRr-tf94LDTM4WT2LgNAE -dwFaS-ljcKFiYIrfgcloAR9ec9yCFsZkKx_icWQOlg ```
Owner

Seems like URL previews based on that?

Seems like URL previews based on that?
Author
Contributor

@Jade wrote in #1398 (comment):

Seems like URL previews based on that?

Hmm. Is that where the c10y banner appears?

I think it's affecting everything, lol.

If I sort by size I see many duplicate profile pictures.

image

If I go to the smaller files, there are many duplicate thumbnails.

image

@Jade wrote in https://forgejo.ellis.link/continuwuation/continuwuity/issues/1398#issuecomment-24386: > Seems like URL previews based on that? Hmm. Is that where the c10y banner appears? I think it's affecting everything, lol. If I sort by size I see many duplicate profile pictures. ![image](/attachments/0543ee81-70a9-43f7-84c2-f4e032fbebc5) If I go to the smaller files, there are many duplicate thumbnails. ![image](/attachments/d6f9c885-0402-4504-afa8-7c9a37846baa)
164 KiB
118 KiB
Owner

Might be something more wide then.

Might be something more wide then.
Owner

For what it's worth, I looked into this before disappearing last night, and figured out how we calculate the file names: it's a sha256 sum of mxc + dimensions + content_disposition + content_type. If you're getting duplicate media with different file names, it's because one of those values has changed, which means it's actually just a new media file. As far as I can tell, this is working as intended - MXCs don't necessarily map 1:1 with a file on disk, there might be many files for one MXC depending on requested parameters.

For what it's worth, I looked into this before disappearing last night, and figured out how we calculate the file names: it's a sha256 sum of `mxc + dimensions + content_disposition + content_type`. If you're getting duplicate media with different file names, it's because one of those values has changed, which means it's actually just a new media file. As far as I can tell, this is working as intended - MXCs don't necessarily map 1:1 with a file on disk, there might be many files for one MXC depending on requested parameters.
Author
Contributor

I do vaguely recall noticing a similar issue when i ran Synapse back in 2021. So it may be a difficulty faced by matrix servers in general, not just continuwuity.

Probably not worth the development effort today because media is generally much smaller that the database anyway.

I wonder if there is any harm in using jdupes to replace duplicates with links to one agreed original? This would be a relatively easy workaround to free up the few hundred MB or more used by duplicates.

I do vaguely recall noticing a similar issue when i ran Synapse back in 2021. So it may be a difficulty faced by matrix servers in general, not just continuwuity. Probably not worth the development effort today because media is generally much smaller that the database anyway. I wonder if there is any harm in using `jdupes` to replace duplicates with links to one agreed original? This would be a relatively easy workaround to free up the few hundred MB or more used by duplicates.
Author
Contributor

What's also interesting is at least on my ext4 system, some of the "same" images appear to have unequal sizes (byte differences).

You can see in my screenshots, when I sort by size, sometimes the stream of duplicates is interrupted by an unrelated image whose size was between the stream (whose members, I assume, ought to have exact byte sizes and matches).

What's also interesting is at least on my ext4 system, some of the "same" images appear to have unequal sizes (byte differences). You can see in my screenshots, when I sort by size, sometimes the stream of duplicates is interrupted by an unrelated image whose size was between the stream (whose members, I assume, ought to have exact byte sizes and matches).
Owner

Likely because the images are a different size / dimension

Likely because the images are a different size / dimension
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
continuwuation/continuwuity#1398
No description provided.