Author Topic: Working with large datasets?  (Read 9174 times)

Offline shaneb

  • Newcomer
  • *
  • Posts: 16
    • View Profile
Working with large datasets?
« on: April 17, 2020, 08:53:07 AM »
What size datasets are people working with out there? My computer seems to have a disagreement with itself about the size of the cache and catalog files. the finder says both are about 2.3gb... but when I got to copy them to another drive the finder says they are 83.3 GB each! I am thinking this is because the folders are not indexed by the system. Still, this is a large file, I am starting to think that the catalog for the entire archive (about 4 times the size of this test drive which holds ~300k images) is going to require it's own 1tb SSD.



Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Working with large datasets?
« Reply #1 on: April 17, 2020, 09:43:26 AM »
What "cache" are you talking about?  A catalog folder contains the database files and a "proxies" folder.

-Kirk

Offline shaneb

  • Newcomer
  • *
  • Posts: 16
    • View Profile
Re: Working with large datasets?
« Reply #2 on: April 20, 2020, 10:23:37 AM »
Hey, sorry Kirk. Should have given more details. The cache file for PM is the one I am referring to. We moved it off the main drive using the settings in PM+ because it was massive and filling up the main SSD in the computer.

So, that file is huge, the catalog is equally huge (understandable) and somehow it still keeps filling the SSD up to the max during the scanning process even though the catalog and cache folders are both on another HDD. When it does that, the process stops and it's nearly impossible to get it going again (again, a big data set) So, I am just trying to figure out who has tested by pointing the catalog scan at large archive folders instead of started from scratch with new material.

Does that make anymore sense?

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Working with large datasets?
« Reply #3 on: April 20, 2020, 11:34:16 AM »
Hey, sorry Kirk. Should have given more details. The cache file for PM is the one I am referring to. We moved it off the main drive using the settings in PM+ because it was massive and filling up the main SSD in the computer.

What is/was the path to this "cache" file?  PM manages a cache folder which contains other folders and eventually files.  But you can control how big this is allowed to get and how much space to reserve.  It shouldn't get out of control with proper settings.

-Kirk

Offline shaneb

  • Newcomer
  • *
  • Posts: 16
    • View Profile
Re: Working with large datasets?
« Reply #4 on: April 20, 2020, 06:21:36 PM »
Shouldn't... I agree. And usually doesn't.



I deleted the cache that exploded in size and the catalog. I'll repeat the scan here in a bit and see what happens this time and take better notes.

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Working with large datasets?
« Reply #5 on: April 20, 2020, 06:42:59 PM »
If you do find that it goes out of control, I'd like to know what particular subfolder is growing large (if you don't mind.)

Thanks,

-Kirk

Offline shaneb

  • Newcomer
  • *
  • Posts: 16
    • View Profile
Re: Working with large datasets?
« Reply #6 on: April 20, 2020, 07:18:46 PM »
Sure thing... I am going to throw a 6tb drive at it overnight (same one as last time) I have cleared all caches, moved them off the boot drive, removed them from spotlight and we'll see what happens!

Thanks so much for building this thing. Even if it took a week or two to build, a fast, searchable catalog of the last 20 years of work would be huge for us... and everyone else. I don't know what you plan for this thing to cost, but if it works reliably with large datasets, the value is big for a lot of us.

Thinking about this as I go, our server is organized by year/date/assignment/filetypes. I am thinking in the end it would probably be best to build a catalog for each year instead of one for the entire server. Thoughts? Should it matter?

Shane

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Working with large datasets?
« Reply #7 on: April 20, 2020, 08:06:14 PM »
Shane,

Sure thing... I am going to throw a 6tb drive at it overnight (same one as last time) I have cleared all caches, moved them off the boot drive, removed them from spotlight and we'll see what happens!

OK, sounds good.

Thanks so much for building this thing. Even if it took a week or two to build, a fast, searchable catalog of the last 20 years of work would be huge for us... and everyone else. I don't know what you plan for this thing to cost, but if it works reliably with large datasets, the value is big for a lot of us.

We haven't completely solidified it but we think most people will be happy with the pricing.

Thinking about this as I go, our server is organized by year/date/assignment/filetypes. I am thinking in the end it would probably be best to build a catalog for each year instead of one for the entire server. Thoughts? Should it matter?

I think that's completely unnecessary.  How many image files in total?

-Kirk

Offline Armin M. Küstenbrück

  • Newcomer
  • *
  • Posts: 14
    • View Profile
Re: Working with large datasets?
« Reply #8 on: April 21, 2020, 05:00:59 AM »
Hi Kirk,
the same for me regarding big datasets: I have yearly folders with round about 500.000 - 1 Million photos of me and my colleagues, in RAW (some with sidecar XMPs)and JPG, in total 15-20 TByte per year, stored in a 200 TByte NAS.
So the folder with the database including previews for one year it's more than 300 GByte, the size of the database for itself (catalog.pmdb) is 20 Gbyte.
So it was also my ideo to generate a catalog for each and every year to avoid any destructions inside the database in case of an hard- oder software error. Then it would be easier to rebuild the database.
Why don't you recommend this?
Cheers,
Armin

Offline shaneb

  • Newcomer
  • *
  • Posts: 16
    • View Profile
Re: Working with large datasets?
« Reply #9 on: April 21, 2020, 07:07:15 AM »
Overnight report... all programs were closed overnight except PM+. The cache, render and other files from the cache library have remained reasonably sized at 150mb. The catalog folder is up to 179mb. When I first came out the boot drive showed to be 10gb fuller than last night... but 20 mins later that space had been freed back up. Does PM+ store data anywhere except the cache file?

What happened the last time? who knows. it still has two days of processing according to PM, so I'll keep watching it.

We produce about 300-400k files a year. That number goes back nearly ten years and then drops off dramatically to 50-100k images a year and goes back another 10 years. so, whats that? 4 million or so currently with a steady 300k increase each year. 


Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Working with large datasets?
« Reply #10 on: April 21, 2020, 09:24:27 AM »
Shane,

Overnight report... all programs were closed overnight except PM+. The cache, render and other files from the cache library have remained reasonably sized at 150mb. The catalog folder is up to 179mb. When I first came out the boot drive showed to be 10gb fuller than last night... but 20 mins later that space had been freed back up. Does PM+ store data anywhere except the cache file?

I see.  You're calling the cache folder a file.  I was thinking that you found some huge mega file called 'cache' that PM was creating.  The cache folder contains various files: images, sounds, databases, logs, etc.  Perhaps one of the log files is getting large?  They're called PM.log and dyn.log and they're in the Photo Mechanic cache folder.

What happened the last time? who knows. it still has two days of processing according to PM, so I'll keep watching it.

We produce about 300-400k files a year. That number goes back nearly ten years and then drops off dramatically to 50-100k images a year and goes back another 10 years. so, whats that? 4 million or so currently with a steady 300k increase each year.

We haven't tested with four million files in a single catalog though we have no defined limits.  I do expect that indexing would get somewhat slow once the number of items in the catalog went over one million files.  You could always have one catalog per year or five years and have only the current catalog set to Add/Modify, and the other catalogs set for Search only.  Then you could search for anything but keep your current catalog fast.  Note: Collections require that the images added to a collection first be a member of the catalog.  So if you're making collections of previous years, you should temporarily set them to Add/Modify and remove the Add/Modify from the other catalogs.  Otherwise, you'll be adding images across catalogs as you create your Collections.

-Kirk

Offline shaneb

  • Newcomer
  • *
  • Posts: 16
    • View Profile
Re: Working with large datasets?
« Reply #11 on: April 21, 2020, 06:31:37 PM »
OK Kirk... here is the update. Hang with me I tried to get as much info as possible. As you can see below, the cache folder has ballooned to 17gb and rising. It's set to be limited to 2gb.



This is what the current process window is showing, which makes me thing PM is generating previews in cache and then moving them to the catalog folder, but not clearing the cache... or at least not quickly enough, since it continues to climb.



This is the specific folder that is holding the large bulk of the data (17gb out of 17.05 or so)



I've never totally understood (in the 20+ years I have been using it) exactly how PM uses cache. But I do know it can be a problem occasionally, specifically with computers now coming with fairly small SSD drives. However, usually its a minor inconvenience.

In previous tests with PM+ when the cache folder was on the boot drive and it filled the drive (even if the catalog was on another drive) PM+ failed to handle it gracefully and locked up and had to be force quit. Then when I restarted, the catalog would never load again and had to be deleted and started from scratch.

I am sure there is some reason PM+ creates the catalog previews in one place and then moves them to another, but it's worth knowing that you have to have a ton of space for both the cache folder and the catalog folder in order to scan large sets of images. For lots of people using laptops (or hell, even my 2013 Mac Pro) 40, 50 or 60 gb on the boot drive can be hard to come by!

Shane


Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Working with large datasets?
« Reply #12 on: April 21, 2020, 07:13:59 PM »
Shane,

Thank you for the screen shots.  It is the proxy generation not deleting temp images.  If all of your images are always going to be available then you can turn off Proxy Generation and save the temporary space and the Catalog folder space as well.  If you need to work with offline images (read-only support at this time) then you can just quit and relaunch Photo Mechanic Plus and the catalog-temp will be cleared and the proxy generation will continue where it left off.

We'll address this problem at some point.

-Kirk

Offline shaneb

  • Newcomer
  • *
  • Posts: 16
    • View Profile
Re: Working with large datasets?
« Reply #13 on: April 22, 2020, 07:28:25 AM »
Thanks Kirk. Really appreciate your help. It's not a huge issue for the moment, just good to know how it operates so you know what to expect as far as storage needs.

On the issue of the huge datasets, I think I'll create a catalog for each year so that I can check dates for correction and limit what the searches are covering. I have not tried multiple catalogs, but it seems fairly simple to click them on and off.

Lastly on the issue of proxies. Am I correct in thinking that building catalogs on a dedicated SSD with proxies would result in much faster searches than without proxies? IE the catalog would only ever access the SSD when searching and not touch the originals unless I opened them in an editing suite or made some other change. This would be especially pertinent in my situation where the originals reside on a network volume.

Shane

Offline Kirk Baker

  • Senior Software Engineer
  • Camera Bits Staff
  • Superhero Member
  • *****
  • Posts: 25020
    • View Profile
    • Camera Bits, Inc.
Re: Working with large datasets?
« Reply #14 on: April 22, 2020, 09:06:09 AM »
Shane,

Lastly on the issue of proxies. Am I correct in thinking that building catalogs on a dedicated SSD with proxies would result in much faster searches than without proxies? IE the catalog would only ever access the SSD when searching and not touch the originals unless I opened them in an editing suite or made some other change. This would be especially pertinent in my situation where the originals reside on a network volume.

If the originals are available, they will always be used.  So the only way the proxies will ever be accessed, is if you take the drives that the images reside on and unmount or disconnect them.

As for search speed, the proxies are irrelevant.  The #1 thing that improves the speed of the database is a fast, local disk.  SSD is the best at this time.

-Kirk