PDA

View Full Version : Metadata parsing failure in files from Photoshop CS2 (PROPOSED FIX INCLUDED IN 2.02)


DavidW
23rd of October 2006 (Mon), 11:50
Finally, I think I've nailed the problem with EE failing to read most metadata (shutter speed, aperture, time, date etc.) in files saved from Photoshop CS2.

The problem is somewhat similar to this bug (http://photography-on-the.net/forum/showthread.php?t=224121) which Pekka has already incorporated the fix for in 2.01. CS2 writes the XMP metadata in a slightly different format - instead of <tag>data</tag>, it uses tag="data" for exif:, tiff: and aux: values.

I've attached a fixed version of ee_extract_xmp_data - if Pekka's original regex fails to produce a result, it uses ereg() against regexp2 to attempt to extract the data. I've tested it on my installation and it seems to work, though I regard it as somewhat scrappy programming.


There's an argument that a new parser should be written from scratch, but I needed this working today and I haven't got much time to mess, so it was a case of patching the code that was there.


Pekka - are you aware that the original code uses preg_match() which is a PCRE function? PCRE is a PHP extension, and may not be installed - though I expect it is installed in most people's PHP. The code I've added uses ereg(), which doesn't require PCRE, but I haven't bothered to rewrite the original regexes in enhanced regex rather than Perl regex form so that I can switch the original set of tests to ereg().

The function below completely replaces the original version of function ee_extract_xmp_data() in SCRIPT_editor_functions.php - it's easier to cut and paste the whole function rather than try to list the edits as they are extensive.

function ee_extract_xmp_data ($filename,$printout=0) {

// very straightforward one-purpose utility function which
// reads image data and gets some EXIF data (what I needed) out from its XMP/XAP tags (by Adobe Photoshop/CS)
// returns an array with values
// code by Pekka Saarinen http://photography-on-the.net

ob_start();
readfile($filename);
$source = ob_get_contents();
ob_end_clean();



$xmpdata_start = strpos($source,"<x:xmpmeta");
if ($xmpdata_start === FALSE) $xmpdata_start = strpos($source,"<x:xapmeta");
$xmpdata_end = strpos($source,"</x:xmpmeta>");
if ($xmpdata_end === FALSE) $xmpdata_end = strpos($source,"</x:xapmeta>");
$xmplenght = $xmpdata_end-$xmpdata_start;
$xmpdata = substr($source,$xmpdata_start,$xmplenght+12);

$xmp_parsed = array();

$regexps = array(
array("name" => "DC creator", "regexp" => "/<dc:creator>\s*<rdf:Seq>\s*<rdf:li>.+<\/rdf:li>\s*<\/rdf:Seq>\s*<\/dc:creator>/", "regexp2" => ""), // All dc: tags don't need a different regex
array("name" => "TIFF camera model", "regexp" => "/<tiff:Model>.+<\/tiff:Model>/", "regexp2" => "tiff:Model=\"([^\"]+)\""),
array("name" => "TIFF maker", "regexp" => "/<tiff:Make>.+<\/tiff:Make>/", "regexp2" => "tiff:Make=\"([^\"]+)\""),
array("name" => "EXIF exposure time", "regexp" => "/<exif:ExposureTime>.+<\/exif:ExposureTime>/", "regexp2" => "exif:ExposureTime=\"([^\"]+)\""),
array("name" => "EXIF shutterspeed value", "regexp" => "/<exif:ShutterSpeedValue>.+<\/exif:ShutterSpeedValue>/", "regexp2" => "exif:ShutterSpeedValue=\"([^\"]+)\""),
array("name" => "EXIF f number", "regexp" => "/<exif:FNumber>.+<\/exif:FNumber>/", "regexp2" => "exif:FNumber=\"([^\"]+)\""),
array("name" => "EXIF aperture value", "regexp" => "/<exif:ApertureValue>.+<\/exif:ApertureValue>/", "regexp2" => "exif:ApertureValue=\"([^\"]+)\""),
array("name" => "EXIF exposure program", "regexp" => "/<exif:ExposureProgram>.+<\/exif:ExposureProgram>/", "regexp2" => "exif:ExposureProgram=\"([^\"]+)\""),
array("name" => "EXIF iso speed ratings", "regexp" => "/<exif:ISOSpeedRatings>\s*<rdf:Seq>\s*<rdf:li>.+<\/rdf:li>\s*<\/rdf:Seq>\s*<\/exif:ISOSpeedRatings>/", "regexp2" => ""),
array("name" => "EXIF datetime original", "regexp" => "/<exif:DateTimeOriginal>.+<\/exif:DateTimeOriginal>/", "regexp2" => "exif:DateTimeOriginal=\"([^\"]+)\""),
array("name" => "EXIF exposure bias value", "regexp" => "/<exif:ExposureBiasValue>.+<\/exif:ExposureBiasValue>/", "regexp2" => "exif:ExposureBiasValue=\"([^\"]+)\""),
array("name" => "EXIF metering mode", "regexp" => "/<exif:MeteringMode>.+<\/exif:MeteringMode>/", "regexp2" => "exif:MeteringMode=\"([^\"]+)\""),
array("name" => "EXIF focal lenght", "regexp" => "/<exif:FocalLength\>.+\<\/exif:FocalLength>/", "regexp2" => "exif:FocalLength=\"([^\"]+)\""),
array("name" => "AUX lens", "regexp" => "/<aux:Lens>.+<\/aux:Lens>/", "regexp2" => "aux:Lens=\"([^\"]+)\""),
array("name" => "DC rights", "regexp" => "/<dc:rights>\s*<rdf:Alt>\s*<rdf:li xml:lang=['\"]x\-default['\"]>.+<\/rdf:li>\s*<\/rdf:Alt>\s*<\/dc:rights>/", "regexp2" => ""),
array("name" => "DC description", "regexp" => "/<dc:description>\s*<rdf:Alt>\s*<rdf:li xml:lang=['\"]x\-default['\"]>.+<\/rdf:li>\s*<\/rdf:Alt>\s*<\/dc:description>/", "regexp2" => ""),
array("name" => "DC title", "regexp" => "/<dc:title>\s*<rdf:Alt>\s*<rdf:li xml:lang=['\"]x\-default['\"]>.+<\/rdf:li>\s*<\/rdf:Alt>\s*<\/dc:title>/", "regexp2" => ""),
array("name" => "PHOTOSHOP headline", "regexp" => "/<photoshop:Headline>.+<\/photoshop:Headline>/", "regexp2" => "photoshop:Headline=\"([^\"]+)\""),
array("name" => "PHOTOSHOP city", "regexp" => "/<photoshop:City>.+<\/photoshop:City>/", "regexp2" => "photoshop:City=\"([^\"]+)\""),
array("name" => "PHOTOSHOP state", "regexp" => "/<photoshop:State>.+<\/photoshop:State>/", "regexp2" => "photoshop:State=\"([^\"]+)\""),
array("name" => "PHOTOSHOP country", "regexp" => "/<photoshop:Country>.+<\/photoshop:Country>/", "regexp2" => "photoshop:Country=\"([^\"]+)\""),
array("name" => "PHOTOSHOP category", "regexp" => "/<photoshop:Category>.+<\/photoshop:Category>/", "regexp2" => "photoshop:Category=\"([^\"]+)\""),
array("name" => "PHOTOSHOP credit", "regexp" => "/<photoshop:Credit>.+<\/photoshop:Credit>/", "regexp2" => "photoshop:Credit=\"([^\"]+)\""),
array("name" => "PHOTOSHOP authors position", "regexp" => "/<photoshop:AuthorsPosition>.+<\/photoshop:AuthorsPosition>/", "regexp2" => "photoshop:AuthorsPosition=\"([^\"]+)\"")
);




foreach ($regexps as $key => $k) {
$name = $k["name"];
$regexp = $k["regexp"];
$regexp2 = $k["regexp2"];
$xmp_item = "";
unset($r);
if (preg_match($regexp, $xmpdata, $r)) {
$xmp_item = @$r[0];
}
else {
unset($s);
ereg($regexp2, $xmpdata, $s);
$xmp_item = @$s[1]; // [1] to retrieve bracketed expression from regex
}
if ($name == "EXIF datetime original") {
$xmp_item = str_replace("Z","",$xmp_item);
$xmp_item = str_replace("T"," ",$xmp_item);
}
if ($name == "AUX lens") {
$xmp_item = str_replace(" ","",$xmp_item);
$xmp_item = str_replace("m","",$xmp_item);
}
array_push($xmp_parsed,array("item" => $name, "value" => strip_tags($xmp_item)));
}

$xmp_supplemental_categories = read_xmp_bag ($xmpdata,"<photoshop:SupplementalCategories>", "</photoshop:SupplementalCategories>");
array_push($xmp_parsed,array("item" => "PHOTOSHOP supplemental categories", "value" => $xmp_supplemental_categories));

$xmp_keywords = read_xmp_bag ($xmpdata,"<dc:subject>", "</dc:subject>");
array_push($xmp_parsed,array("item" => "DC keywords", "value" => $xmp_keywords));

//$xmp_author = read_xmp_bag ($xmpdata,"<dc:creator>", "</dc:creator>");
//array_push($xmp_parsed,array("item" => "DC creator", "value" => @$xmp_author[0]));


if ($printout != 0) {
foreach ($xmp_parsed as $key => $k) {
$item = $k["item"];
$value = $k["value"];
if (gettype($value) != "array") {
print "<br><span style=\"color: #990000;\"><b>" . $item . ":</b></span> " . $value;
} else {
print "<br><b>" . $item . ":</b> ";
ee_print_array($value);
}
}
}
$source = "";
return ($xmp_parsed);
}

Feedback would be appreciated from anyone. I've heard a few hints of metadata reading failure with EE2 - if all your photos are being read with a 'default' time and date (possibly in 2002), this could be your issue.

Pekka - if you're happy with the code from a security point of view, can this be incorporated into 2.02? It's heaps better than keying in all the metadata by hand!



David

reiger
28th of October 2006 (Sat), 04:15
So I've incorporated the new code into SCRIPT_editor_functions.php. Server Tools shows the XMP metadata is being read correctly. How would I take advantage of the XMP data to customize my output? How would I display, say, $xmp_credit on photo.php output?

DavidW
28th of October 2006 (Sat), 06:57
Can you confirm that you're saying that I've got EE parsing metadata in your photos that it previously couldn't? I'd like someone to underline that this is a valuable fix.


Most of the fields found by that code are read into the database when the photo is originally added - but I don't believe all of them are. View Combined Camera Data shows what EE is intending to use of the available metadata.

Some IPTC fields from the XMP data are used - the city / state / country fields are certainly used. I don't believe the headline, credit or author position fields are used - I don't make use of them in my workflow (I use title and description instead). Whether headline and credit are (or could be) used instead of title and description if title and/or description are not present, I don't know - that's in code I haven't yet looked at.

I'd also like an easy way of re-reading metadata from an existing file on the server. This is useful when debugging the metadata reading code, and will be even more important if I get round to writing GPS support (http://photography-on-the.net/forum/showthread.php?t=224680). It's sort of there in the photo editor, but you can only read metadata in a file you upload, not from the files already available for that photo via the size path(s). Something of what is needed is in the function to read size paths - maybe that can be extended to re-read metadata. I need to think about the best way to implement that.


I do intend to do more work on EE's metadata handling in the future, not least towards adding GPS support. Amongst other tasks, the whole of ee_extract_xmp_data() really needs rewriting - it's very inefficient code, making numerous regular expression searches across the data. It does start out by identifying the fragment of the photo that contains the XMP or XAP data, and only searching that, but it's still a very inefficient way of parsing data.

Efficiency isn't a huge concern here, as the code is only used by the management interface. However, the maintainability is somewhat suspect, and there is also the problem of the current code depending on PCRE.

Before adding GPS support, I hope to rewrite this function as a more conventional parser that parses on tokens and extracts the corresponding data. I can make the parser deal cleanly with the different data formats that can be found (<tag>data</tag>, tag="data", the "bag" format) as it's rewritten, which will make it easier to maintain and extend in the future. As part of this, I'd spend time to read the Adobe XMP standards, so that I can make sure the code is implemented in the most all-encompassing way possible.


At the moment, I'm not in a position to make any progress on this for at least a few weeks. Not only am I busy, there's a few pre-requisites for this work that I still have to get working.

I still haven't got the FreeBSD DBG port updated to 2.15.1; unfortunately the author provides very limited support for the free version of DBG, and there's some kind of weird interaction between the 2.15.1 build system, FreeBSD's port building system and phpize that's tricking autoconf into doing something stupid and which I need to take time to understand. I can get DBG to compile if I don't phpize (DBG 2.15.1's build system doesn't use phpize), but I don't then know the correct location for dbg.so. If I phpize, the port fails to build. Until I can get debugging facilities on my development server, it's hard to make progress with the code. Whilst I could manually install DBG, I have a policy of only using ports to install software on FreeBSD for maintainability reasons.

I also have a weird problem with my router's firewall sometimes resetting TCP sockets between my workstation and my FreeBSD box. I've temporarily solved this by multihoming my workstation (the joys of tagged VLANs!), but I should really report it to the router vendor, which means spending time sitting here with Wireshark recording and documenting packet traces. On a similar note, I really need to write proper pf rules for my FreeBSD box (particularly as the free version of DBG doesn't support any restrictions on access - though the right answer there is probably either to write a patch to add access controls to the free version of DBG if the DBG licence permits that, or use another jail to run Apache and PHP including DBG which is only available on my LAN).


If Pekka or anyone else has time to look further at any of these metadata handling issues, I'd be interested to know what comes of your work. Understandably, most people's interests with changing, extending and modifying EE 2 are with styles at the moment, as they affect how your gallery is seen by the world. We're also in a 'melting pot' situation with browsers (IE 7 has just been released, the final version of Firefox 2 is imminent if it's not already released).

However, I believe for EE to be truly useful, the metadata handling has to be robust, especially for users like me that make extensive use of metadata within our workflow. I don't key any titles or descriptions into EE - they all come from metadata. At least with the code in the post that started this thread, EE correctly picks up the metadata that is present in files saved from Photoshop CS2.



David

reiger
28th of October 2006 (Sat), 17:41
Can I confirm the parser is working for me when it wasn't before. No, sorry. All my photos are processed using CS2 before getting loaded into EE2. The original function ee_extract_xmp_data() in SCRIPT_editor_functions.php was working fine for me most if not all of the time. I had one exhibition that when sorted by date a single photo was sorted incorrectly. It turned out that either EE failed to pull the correct time out of the photo and supplied its own value or I somehow managed to edit the photo's time (something I never do). The photo's time reported through the admin photo editor was something like 333 or 888, which is odd since EE won't allow me to enter an integer for the time field and leave it formatted that way. It'll convert the integer to a time - usually 00:00:00.

Any way, EE showed the photo to have a correct date and incorrect time. For some reason EE ignored the photo's date and sorted it to be the first photo in the exhibition when sorted by date (as if the photo had a date of 1/1/2006). Odd. When I noticed the problem I edited the photo's time in the admin photo editor. From then on EE correctly sorted the photo in the exhibition.

I tried to replicate the error by reloading the same photo into the EE db, but EE handled the photo fine the second time. I have lots of exhibitions. I'll check to see if any other photos have had problems like that.

I'm slowly working to add exhibitions for genealogy/historical photos to my site. I created a custom xmp file info panel to capture the additional genealogy data I needed for thousands of photos. Now I need to figure out how to pull out that additional metadata through photo.php for just some of my exhibitions....